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Preface 


The purpose of this book is to establish a mathematical theory of Bayesian 
statistics. 

In practical applications of Bayesian statistical inference, we need to pre- 
pare a statistical model and a prior for a given sample, then estimate the 
unknown true distribution. One of the most important problems is devising 
a method how to construct a pair of a statistical model and a prior, although 
we do not know the true distribution. The answer based on mathematical 
theory to this problem is given by the following procedures. 

(1) Firstly, we construct the universal and mathematical laws between Bayesian 
observables which hold for an arbitrary triple of a true distribution, a sta- 
tistical model, and a prior. 

(2) Secondly, by using such laws, we can evaluate how appropriate a set of 

a statistical model and a prior is for the unknown true distribution. 

(3) And lastly, the most suitable pair of the statistial model and the prior 
is employed. 

The conventional approach to such a purpose has been based on the 
assumption that the posterior distribution can be approximated by some 
normal distribution. However, the new statistical theory introduced by this 
book holds for arbitrary posterior distribution, demonstrating that the ap- 
plication field will be extended. The author expects that also new statistical 
methodology which enables us to manupulate complex and hierarchical sta- 
tistical models such as normal mixtures or hierarhical neural networks will 
be based on the new mathematical theory. 


Sumio Watanabe 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


Chapter 1 


Definition of Bayesian 
Statistics 


In the first chaper, we introduce basic concepts in Bayesian statistics. In 

this book, we assume that there exists an unknown true distribution or 

unknown information source from which random variables are generated. 

Also we assume that an arbitrary set of a statistical model and a prior is 

prepared by a statistician who does not know the true distribution. Hence, 

from the mathematical point of view, the theory proposed in this book holds 

for an arbitrary set of a true distribution, a statistical model, and a prior. 
The contents of this chaper include: 

(1) In statistical estimation, we prepare a statistical model and a prior, 

though a true distribution is unknown. Hence evaluation of a statistical 

model and a prior is necessary. 

(2) Several examples of probability distributions are introduced. 

(3) It is assumed that an information source is represented by a probability 

distribution, which is called a true distribution. 

(4) The posterior and the predicitive distributions are defined for a given 

statistical model and a prior. 

(5) Two examples of posterior distributions are illustrated. In a simple 

estimation problem, the posterior distribution can be approximated by a 

normal distribution, whereas in a complex or hierarchical model, the result 

is far from any normal distribution. In this book we establish Bayesian 

theory which holds for both cases. 

(6) The generalization loss is estimated by the cross validation loss and the 

widely applicable information criterion (WAIC). 

(7) The marginal likelihood and the free energy of statistical estimation are 
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introduced. 

(8) Statistical estimation in a conditional independent case is studied, in 
which the cross validation loss can not be used but WAIC can be. 

For readers who are new to probability theory, chapter 10 will be helpful. 


1.1 Bayesian Statistics 


In this book, we assume that a sample {x1,22,...,%,} is taken from some 
true probability distribution q(x), 


104; Dips yt Q(z). 


This process is represented by a conditional probability distribution of a 
sample {21,22,...,%n} for a given true distribution q(x), 


Pg Boj cn Pala) 


If we knew both P(2x1,29,...,%n|q) and P(q), where P(g) is an a priori 
probability distribution of a true distribution, then by Bayes’ theorem, 


Pai, v2, w+ En|q)P(q) 


P(q|21, £0, ..., Ln) = ——— 
(alt, 829 rn) = Bla ta, on Pal PD) 


which would give the statistical inference of q(x) from a sample {x1,22,..., 
Xn}. However, in the real world, we do not have any information about 
either of them, showing that P(q|1, 72, ...,%,) cannot be obtained. 

A problem whose answer cannot be uniquely determined because of the 
lack of the information is called an ill-posed problem. Statistical inferences 
in the real world are ill-posed. In an ill-posed problem, we cannot deter- 
mine a uniquely optimal method by which a correct answer is automatically 
obtained, which leads us to propose a new way: 


Choose method > Result + Evaluate chosen method. 


It might seem that such an evaluation is impossible because we do not 
have any information about the true distribution. However, in Bayesian 
statistics, there are mathematical laws which hold for an arbitrary set of a 
true distribution, a statistical model, and a prior. By using formulas derived 
from the mathematical laws, we can evaluate the appropriateness of the set 
of a statistical model and a prior even if a true distribution is unknown. The 
purpose of this book is to establish such mathematical laws. 
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In Bayesian statistics, we set a statistical model p(z|w) and a prior y(w), 
where p(a|w) is a conditional probability density of x for a given parameter 
w and y(w) isa probability density of w. Both p(x|w) and y(w) are prepared 
by a statistician who does not know a true distribution. Hence they may be 
quite different from or inappropriate for the true distribution. If a sample 
{x1, £2, ...,%y} consists of independent sample points from a true distribution 
q(x), then the posterior probability density function of w is defined by 


n 


[[p@ile) ew) 
n) a —- . 
[Tecieretwiaw 


i=1 


pl Gi, Loy 025 B 


This is the definition of the posterior distribution. The estimated probability 
density function of x is defined by 


A(x) = / COC Te mera 


This is also the definition of the predictive distribution. A statistician esti- 
mates unknown q(x) by p(x). For an arbitrary triple (q(x), p(z|w), p(w)), 
we can define the Bayesian inference by this procedure, however, we need 
to examine whether a statistical model and a prior are appropriate for the 
unknown true distribution. Figure 1.1 shows the process of Bayesian esti- 
mation. 


Remark 1. If we knew the true prior yo(w) to which a parameter as a ran- 
dom variable is subject, and if a sample {21, 72, ...,%,} was independently 
taken from the true conditional probability density po(z|w), then the pre- 
dictive distribution f(x) using po(xz|w) and yo(w) would be the uniquely 
best inference. This is called the formal optimality of the Bayesian infer- 
ence. See Section 9.1. However, in the real world, we do not know either 
po(xz|w) or yo(x), indicating that we need evaluation because f(x) may be 
quite different from q(x). 


The candidate set p(x|w) and y(w) is prepared by a statistician without 
any information about the true distribution. If the modeling (p(z|w), y(w)) 
is appropriate for the unknown true q(x), then it is expected that p(x) ~ 
q(x). However, if otherwise then p(x) 4 g(x). Hence we need a method 
to evaluate the appropriateness of the modeling (p(z|w), p(w)) without any 
information about q(x). In this book, we show such a method can be made 
based on mathematical laws which hold for arbitrary (q(x), p(z|w), p(w)). 
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Generate 
True q(x) Sample X,,X>,....X%_ 
| Evaluation | Estimation 
——E 
Inference 
Predictive p(x|X") Posterior p(w{X") 


Figure 1.1: Framework of Bayesian inference. The procedure of Bayesian 
estimation is shown. A sample X” is taken from unknown true distribution 
q(x). A statistician sets a statistical model and a prior, then the posterior 
density p(w|X") is obtained. The true distribution g(x) is estimated by a 
predictive density p(a|X"), whose accuracy is evaluated by using mathe- 
matical laws. 


1.2. Probability Distribution 


Let us introduce a basic probability theory. For a reader who needs mathe- 
matical probability theory, Chapter 10 may be of help. 

Let x = (21, %2,...,2N) be a vector contained in the N dimensional real 
Euclidean space R%. A real valued function 


q(x) = O(@1, %2,...,2N) 
is said to be a probability density function if it satisfies 


e For arbitrary x € RY, q(x) > 0, 


e focwar =f fo face ea,..2w)dandy: dey = 1. 


Let A be a subset of RY which has an finite integral value 


I q(x) de. 
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A function Q of a set A defined by 


is called a probability distribution. Note that Q(R‘) = 1 and Q(@) = 0, 
where © is the empty set. 

If the probability that a variable X is in a set A is equal to Q(A), then X 
is called a random variable and q(x) and Q are called the probability density 
function and the probability distribution of a random variable X, respec- 
tively. Also it is said that a random variable X is subject to a probability 
density q(x) or a probability distribution Q. 

Example 1. (1) Let N = 1. A probability density function of a uniform 
distribution on [ao, bo] (ao < bo) is given by 


(ao < x < bo) 
(otherwise) 


ue) =4 ao 


(2) Let S be an N x N positive definite matrix and m € RN. A normal 
distribution which has an average m and a covariance S' is defined by 


a(x) = Gexp(—5(2—m, Sx — m))), 


where (, ) is the inner product in RY and 
C = (2n)N/?,/det(S). 
(3) Let H(x) be a function of « € RX and 6 > 0. If 
2(3) = f exp(—BH(2))de 


is finite, then a probability density function 


q(#) = exp(—8H (x) 


1 
Z(8) 
is called an equilibrium state of a Hamilton function H(x) with the inverse 
temperature /. 


The delta function 6(x) is characterized by two conditions, 


_f too (ifz=0) 
s(a) = { 0 (if240) ’ 
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| ite =i, 


The delta function can be understood as the probability density function of 
a random variable X = 0. Its probability distribution is given by 


and 


1 (if0¢€ A) 
a4) ={ 0 (if0¢ A) 


If a random variable X satisfies that X = 1 and X = 2 with probabilities 
1/3 and 2/3 respectively, then its probability density function is 


gt) = IG —1)+ aC — 2). 


If X is a random variable which is subject to g(x) and Q, then Y = f(X) is 
also a random variable which is subject to 


ply) = f oy Flw))aleyae, 


| q(x)dz. 
f(z)EA 


These equations hold even if f(x) is not one-to-one. 


P(A) 


Remark 2. The function 6(x) is not an ordinary function of x. However, 
it is mathematically well-defined by Schwartz distribution theory and Sato 
hyperfunction theory. In this book, the delta function is necessary to study 
posterior distributions which cannot be approximated by any normal distri- 
bution. 


Assume that a random variable X is subject to a probability density 
q(x). The expected value, the average, or the mean of a random variable X 
on RN is defined by 


[x] = f x a(a)de, 


if the right hand side is finite. The expectated value of Y = f(X) is 


EY] 


[uray - [vf su - F@)ale)azdy 
[ t@aloyae. 
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The covariance matrix of X is defined by 


V[X] = El(X —E[X])(X -E[X])"] 
= E[XX7]-E[X]E[X"], 


if the right hand side is finite, where (_ )” shows the transposed matrix. 
If N = 1, then V[X] and V[X]!/? are called the variance and the standard 
deviation, respectively. 


Let (X,Y) be a pair of random variables which is subject to a probability 
density q(x, y) on RM xR. Here q(x, y) is called a simultaneous probability 
density of (X,Y). Then X and Y are subject to the probability densities 


| oteswe 
=f ae,y)ae, 


where q(x) and q(y) are called marginal probability densities of X and Y, 
respectively. The conditional probability density of Y for a given X is 
defined by 


iQ 

oN 
8 

NS” 


2 

Fi. 

< 

= 
| 


q(x, y) 
q(x) 


q(y|z) = 


If g(x) = 0, then q(y|x) is not defined, however, we define 0-q(y|z) = 0. The 
conditional probability density function q(z|y) is also defined by q(x, y)/q(y). 
Then it follows that 


g(x,y) = q(y|x)q(x) = a(z|y)a(y). 


This equation is sometimes referred to as Bayes’ theorem. 


1.3. True Distribution 


In this book, it is mainly assumed that a sample is a set of random variables 
taken from a true distribution. 

Let n be a positive integer. A set of R\-valued random variables X), 
X92, ..., Xn is sometimes denoted by 


X” = (X1, Xo, sig aheng J 
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Throughout this book, the notation n is used for the number of random 
variables. Sometimes X” and n are referred to as a sample and a sample 
size, respectively. A realized value of X” in a trial is denoted by 

sp” = (65 08510 By): 


If X” is subject to a probability density function, 


q(x1)9q(@2) +++ g(&n) 


then X” is called a set of independent random variables which are subject 
to the same probability density q(x). Here q(x) is sometimes referred to as a 
true probability density. In the practical applications, we do not know q(z), 
but we assume there exists such a density q(z). 

For an arbitrary function f : 2” 4 f(a") € R, the expected value of 
f(X") over X” is denoted by E| |. That is to say, 


f(X")| =f J- feo q(x;)dx,dxg-+- dx. 


The variance of f(X") is denoted by 
VF (X")] = E[f(X")?] — E[f(X"))’. 


The average and empirical entropies of the true distribution are respectively 
defined by 


5 = — f a(x) tog a(x)ae, (1.1) 


S, = ~~ J log a(X:) (1.2) 
i=1 
Then by the definition, 
HS) = 2, (1.3) 
ViSn] = =[ f a(e)Qoga(e))?ax — $*). (1.4) 


Remark 3. (The number n) In statistics, x; and x” are referred to as a 
sample point and a sample, respectively. The number n is called a sample 
size. In machine learning, x; and x” are referred to as a datum and a set of 
training data. The number n is called the number of training data or the 
number of examples. In this book, the notation n is used for the number 
of random variables, which is equal to the sample size in statistics and the 
number of training data in machine learning. 
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If a set of R“ x R%-valued random variables (X",Y”) is subject to a 
probability density function, 


q(@1, 41) 4(©2, y2) aie CLs tnd 


then (X",Y”) is a set of independent random variables. The average and 
empirical entropies are repsectively given by 


S = — | (x,y) 10g a(e, aay, (1.5) 


1 nm 
8 pee ) (1.6) 


These are referred to as the simlutaneous average and empirical entropies. 
We often need to estimate the conditional probability density function q(y|x) 
under the condition that (X",Y”) is obtained. The average and the em- 
pirical entropies of the true conditional distribution are respectively defined 
by 


S = — | a2,y) 0g a(ylearay, (1.7) 
1 nm 
So = ~ a Dy 0B a(HI%S) (1.8) 
Then by the definition, 
E(S,] = 3S, 
_ i 2 _ @2 
V[Sn] = =] f a(a,y)(log a(yla))"dady — 5°). 


Sometimes we need to study cases when (X”,Y"”) is not independent, but 
Y” for a given X” is independent. Such a case is explained in Sections 1.8 
and 5.5. 


1.4 Model, Prior, and Posterior 


Let W be a set of parameters which is a subset of d dimensional real Eu- 
clidean space R?. A statistical model or a learning machine is defined by 
p(a|w) which is a conditional probability density of « € R% for a given 
parameter w€ W. A prior y(w) is a probability density of w € W. 
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Let X” = (Xj, Xo,..., Xp) be a set of random variables which are inde- 
pendently subject to a probability density function g(a). For an arbitrary 
pair (p(x|w), p(w)), the posterior probability density is defined by 

1 nm 


plwlX") = zea o(w) [rt Xihe), 
i=l 


where Z(X") is defined by 


n 


2(x") = f ow) [[ p(Xilw)du, 


i=1 
which is called the partition function or the marginal likelihood. The 
posterior average or the expected value over the posterior distribution is 
denoted by E,,| ]. For an arbitrary function f(w), 
Sul(w)] =f £0w)p(w)x")aw 
SOON Deca 
Z(X"™) Y Pp 7 " 


The posterior variance is also defined by 


Vulf(w)] = EwLf(w)?] — Ewlf(w))?. 


Remark that the expectation operator E,,| |] depends on a set X", hence 
tw(f(w)| is not a constant but a random variable. The predictive density 
function is defined by 


p(2|X") = Ey[p(x|w)] = [ vlciw)pw| x") 


For a given sample X", the Bayesian estimation of the true distribution is 
defined by p(x|X”). 


Remark 4. Sometimes a prior function which satisfies 


[ elwidw = 00 


is employed. If [ y(w)dw < oo, it is called proper, because it is normalized 
so that [ p(w)dw = 1. If y(w) is not proper, then it is called improper. Even 
for an improper prior, the posterior and predictive probability densities can 
be defined by the same equation if Z(X”) (n > 1) is finite and they are 
well-defined. 
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Let (X”",Y") be a set of random variables which are independently sub- 
ject to a probability density g(x,y) = q(y|x)q(x). For an arbitrary pair 
(p(y|x, w), p(w)), the posterior probability density is defined by 

1 n 
nm nN\ __ 3 a 
p(w|X",Y") = Zam yay) [[ovilxw) (1.9) 


where Z(X”,Y™”) is defined by 


n 


2(x".¥") = f ow) [] oHiLX, wae. 


i=1 


The posterior average is also denoted by E,,[ ]. For an arbitrary function 


fw), 


EuLf(w)] = [ feopplwixr,¥ryaw 


The posterior variance is also defined by 


Vulf(w)] = Ewlf(w)?] — Ew[f(w)]?. 
The predictive density function is defined by 


p(yla, X",Y") = Ew[p(y|a, w)] = [ plate. wyplwl x" Y") dw. (1.10) 


For the case when Y” is independent for a given X”, see Sections 1.8 and 
bea. 


1.5 Examples of Posterior Distributions 


Let us illustrate several posterior distributions. In a simple statistical model, 
the posterior distribution can be approximated by a normal distribution, but 
not in a complex model. One of the main purposes of this book is to establish 
the universal mathematical theory which holds in both cases. 


Example 2. (Normal Distribution) A normal distribution whose average and 
standard deviation are (a,c) is defined by 


xz —a)? 
p(rla,o) = ss xn (- 9), (1.11) 
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Figure 1.2: Posterior distributions of a statistical model eq.(1.11) with 
n = 10 are shown. The white square is the true parameter. Even for an iden- 
tical true probability density, posterior distributions fluctuate depending on 
samples. 


Let us study a case when a prior is set 


aso) = { 1/4 (lal|<1,0<o0<2) 


0 (otherwise) 


Assume that a true distribution is g(a) = p(#|0,1) and the number of inde- 
pendent random variables is n. In this case, the parameter that attains the 
true density is unique, 


q(x) = p(la,o) ==> (4,0) = (1,0). 


In Figure 1.5, posterior distributions for 12 different samples with n = 10 
are shown using gray scale. The white square is the position of the true 
parameter (0,1). Even if a true probability density function is identical, the 
posterior distribution has fluctuations according to a sample. 

In Figure 1.3, posterior distributions for 12 different samples with n = 
50 are shown. In this case, the posterior distributions concentrate in a 
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0 1 2 (0) 1 2 (0) 1 2 0 1 2 
1 1 1 1 
| Wiel @ | @ i} & 
-1 -1 1 1 
0 1 2 (0) 1 2 (0) 1 2 10) 1 2 


Figure 1.3: Posterior distributions of a statistical model eq.(1.11) with n = 
50. The white square is the true parameter. Posterior distributions can be 
approximated by some normal distribution. 


neighborhood of the true parameter when n becomes large. It seems that 
the posterior distribution can be approximated by some normal distribution, 
hence we expect that a conventional statistical theory using the posterior 
normality can be applied to evaluation of statistical modeling. 


Example 3. (Normal Mixture) One might think that a posterior distribution 
can be approximated by some normal distribution if d/n is sufficiently small, 
where n and d are the number of random variables and the dimension of 
the parameter, respectively. However, such consideration often fails even in 
unspecial statistical models. Let N(x) be the standard normal distribution, 


A normal mixture which has a parameter (a,b) is defined by 


p(zla,b) = (1—a)N(x) + aN(ax — b), (1.12) 
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where 0 <a<1landbe€R. Let us set a prior by 


coe 1 (0<a,b<1) 
ee — | (otherwise) 


Assume that a true probability density is g(a) = p(|0.5, 0.3). Then 
q(x) = p(ala, b) => (a,b) = (0.5, 0.3), 


by which one might expect that the posterior distribution will concentrate 
on the neighborhood of the true parameter (0.5,0.3). The real posterior 
distributions for n = 100, n = 1000, and n = 10000 are shown in Figures 
1.4, 1.5, and 1.6, respectively. 


Figure 1.4: Postrior distributions of a statistical model eq.(1.12) with n = 
100. The white square is the true parameter. The posterior distributions 
are far from any normal distribution and their fluctuations are very large. 
The statistical theory in this book enables us to estimate the generalization 
loss even in this case. 


Even when n = 10000, the posterior distribution cannot be approxi- 
mated by any normal distribution. The regular statistical theory, which 
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Figure 1.5: Postrior distributions of a statistical model eq.(1.12) with n = 
1000. The white square is the true parameter. The number n = 1000 seems 
to be sufficiently large, however, the posterior distributions are far from any 
normal distribution and their fluctuations are very large. 


assumes that the posterior distribution can be approximated by a normal 
distribution, cannot be applied to this case. Therefore, in order to use the 
regular asymptotic theory, the condition n >> 10000 is necessary, if oth- 
erwise it has been difficult to establish a statistical hypothesis test or a 
statistical model selection. In this book, we show a new statistical theory 
which holds even if n = 100 can be established by a mathematical base. 


Both statistical models given by eq.(1.11) and eq.(1.12) are employed 
in many statistical inferences. In the former model, p(z|m,s) represents 
one normal distribution for an arbitrary (m, s), whereas in the latter model, 
p(a|a,b) represents one or two normal distributions, depending on the pa- 
rameter. In fact, if ab = 0, then p(z|a,b) is a standard normal distribution. 
Hence the parameter (a,b) is not a simple parameter but affects the struc- 
ture of a statistical model. In general, if a statistical model has hierarchical 
structure or a hidden variable such as the latter model, then the posterior 
distribution cannot be approximated by a normal distribution in general. 
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Figure 1.6: Postrior distributions of a statistical model eq.(1.12) with n = 
10000. The white square is the true parameter. In this case, n = 10000 and 
the number of parameters is 2, however, the posterior distributions can not 
be approximated by any normal distribution. 


In this book, in Chapter 4, we study the former statistical models, and in 
Chapters 5 and 6, we derive a new mathematical theory for both models. 

From the mathematical point of view, the statistical model of eq.(1.11) 
does not have singularities, whereas that of eq.(1.12) does. The true pa- 
rameter (0.5, 0.3) in eq.(1.12) is a nonsingular point but lies near singularity 
(0,0). In fact, the function from a parameter to a statistical model is not 
one-to-one, 


{(a, b); p(zla, b) _ p(x|0, 0)} = {(a, b); ab = O}, 


and (a,b) = (0,0) is a singularity of this set. It should be empasized that 
a singularity affects the posterior distribution even if the true parameter is 
not a singularity. Moereover, several statistical models used in infromation 
processing such as artificial neural networks and mixture models have many 
singularities. In general, if a statistical model has singularities, then the 
Bayesian estimation has better generalization performance than the maxi- 
mum likelihood method, hence we need new statistical theory for the purpose 
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of constructing hypothesis testing, model selection, and hyperparameter op- 
timization for such singular statistical models. 


1.6 Estimation and Generalization 


In order to evaluate how accurate the predictive density is, we need an 
objective measure which indicates the difference between the true and the 
estimated probability density. 


Definition 1. (Generalization and Training Losses) Let X” be a sample 
which is independently taken from a true distribution g(a) and p(z|X") be 
a predictive density using a statistical model p(xz|w) and a prior y(w). The 
training and generalization losses are respectively defined by 


Tr = -= > logp(XilX"), (1.13) 
i=1 
Cy. = — f a2) 10g p(e|X" de. (1.14) 


Note that both G,, and T;, are random variables. Let S be the entropy 
of a true distribution given by eq.(1.1). Then it immediately follows that 


Gn—S 


. ; (0) log p(a|X") der + / g(a) log q(a)dzx 


= x) lo aa) xv 
= fa 08 aX") 


= K(q(@)||p@|s")), (1.15) 


where I (q(x)||p(a|X")) is the Kullback-Leibler distance from q(x) to p(x|X”). 
For the definition of Kullback-Leibler distance, see Section 10.2. In general, 
(1) K(a(2)|kp(2|X")) > 0. 

(2) K(q(2)||p(2|X")) = 0 if and only if K(q(2)||p(2|X")) = 0. 

Hence 

(1) Gy 2s: 

(2) G, — S = 0 if and only if q(x) = p(a|X”). 

That is to say, the smaller G,, is, the more precise estimation is obtained 
according to Kullback-Leibler distance. The random variables G,, — S and 
Ty, — Sp are called generalization and training errors respectively. 


Remark 5. Assume that we have two sets of statistical models and priors, 


(pi(z|w), pi(w)), (pa(az|w), p2(w)). 
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Let pj(a|X") and po(a2|X") be predictive densities of two pairs respectively, 
and G,,(1) and G,,(2) be their generalization losses. Since the entropy S 
does not depend on either a model or a prior, 


Gn(1) > Gn(2) — K(q(@)||pi(@1X")) > K(a(a)||p2(@|4")), 


which shows that the smaller generalization loss is equivalent to the smaller 
Kullback-Leibler distance. Two training losses T;,(1) and T;,(2) can be de- 
fined for both sets, but they do not have such properties. In other words, 
the smaller training loss does not mean a smaller generalization error. 


Definition 2. Assume n > 2. Let X” \ X; be a set of random variables Xj, 
X2, ..., Xn which does not contain X; and p(a|X” \ X;) be the predictive 
density using X” \ X;. The cross validation loss is defined by 


Cy = —= Slog p(Xi|X" \ Xi). (1.16) 
a i=1 


Also C,, — S,, is called a cross validation error. 


Remark 6. The definition eq.(1.16) is called the leave-one-out cross valida- 
tion loss. There are several kinds of cross validation losses, however, we 
mainly study the leave-one-out one in this book, because it is most accuate 
as an estimator of the generalization loss. The cross validation loss can be 
defined even if X” is dependent. However, if X” is dependent, then it is not 
an appropriate estimator of the generalization loss. In fact, the following 
theorem needs independence. 


Theorem 1. Assume that X” is independent. Then the following holds. 
(1) Assume that the expectation values of G, and C,, are finite. Then 


z[C,,] = E[G,,_1]. (1.17) 


(2) The cross validation loss satisfies the relation, 


C 1 er : | 1 
— _ Oo ] jee dS 
(3) For an arbitrary set of random variables X”, 
Cd. 


The equality Cp, = Ty, holds if and only if p(Xilw) (¢ = 1,2,...,n) is a 
constant function of w on {w € W;p(w|X"”) > Of. 
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Proof. (1) The set X” \ X; does not contain X;, hence 


E[C,] = —E [= § log p(X; |X” \ xi)| 


i=1 


| / g(x) log p(2|X" \ X;)dx] 


(2) For an arbitrary 3, 


[] ile) = pXilw) TJ ple. 
jel j#i 


By the definition of the cross validation loss, 
| = we (X; ms p(w vee 


rin (w ets 1 P(Xi|w)dw 
os Pew) Tp Xlwydw 


which shows (2) of the lemma. 
(3) By using the result (2), 


n= Tn = => log (Bulp(Xilw)]Eull/oXilw))). 
i=l 


From Cauchy-Schwarz inequality, it follows that 


Ew (p(Xi|w)|Exw [1/p(X;|w)] > Ew[p(Xilw)"/?p(Xj\w) 7? = 1. 


The equality C,, = Ty holds if and only if p(X;|w)/? « p(X;|w)71/? as a 
function of w, which concludes (3). O 


Remark 7. (1) The conditional probability density p(X;|X” \ X;) is the pre- 
dictive density of X; based on a sample X” leaving X; out. Thus the average 
cross validation loss is naturally an unbiased estimator of the generalization 
loss for n— 1. In the real world, the generalization loss cannot be calculated 
because we do not know the true distribution q(x), whereas the cross vali- 
dation loss can be obtained using only a sample X”. There are two issues 
about the cross validation. 
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e Although the averages of C, and G,,_1 are equal to each other, their 
variances are not directly derived from their definitions. In the follow- 
ing chapter, we prove that the standard deviations G, —S and C,,— Sy, 
are asymptotically equal to each other, in proportion to 1/n. 


e If the average by the posterior distribution is numerically approxi- 
mated, then 


hee il 
ISCV = - d log Bow Feat 


is called the importance sampling cross validation loss. Note that the 
approximated cross validation loss 


ie 
= ——) lg ES p(X; 
CV “2 og Ey,’ [p(Xilw)], 


where EG a | shows the posterior average for X"\ X;, is different from 
the importance sampling cross validation loss if the posterior density 
is not precisely approximated. 


Definition 3. Assume n > 1. Let X” be a set of random variables. The 
widely applicable information criterion (WAIC) is defined by 


1 n 
W, =T,+—Y Vyllorn(X|w)), 1.18 
oa > llog p(X; |w)] (1.18) 


where V,,[ | shows the posterior variance. Also W,, — S;, is called a WAIC 
error. 


In the following chapters, we show that, if X” is independent, then WAIC 
is asymptotically equivalent to the cross validation loss, 


Wn = Cn + Op(1/n’), (1.19) 
and 
‘[W,] = E[C,] + O(1/n?). (1.20) 
Moreover, there are several cases even if X” is dependent, 
[Wp] = E[G,] + o(1/n). (1.21) 
For example, the formula E[C,,] = E[G,_1] does not hold in conditional 


independent cases such as regression problems of fixed inputs or time series 
prediction, whereas E/W,,] = E[G,] + 0(1/n) holds even for such cases. 
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Remark 8. The cross validation loss and WAIC can be employed for evalu- 
ation of a statistical model and a prior even if a prior is improper. 


Remark 9. (Loss and error) The generalization, cross validation, and WAIC 
errors are defined by 


Cg = 8, Ca = Dian Wr — Sn, 


where S and S,, are the average and empirical entropies of a true distribution, 
respectively. In practical applications, we do not know the true distribution, 
resulting that S and S, are unknown. However, neither S nor S;, depends 
on a statistical model and a prior. In model selection and hyperparameter 
optimization, minimizing losses are equivalent to minimizing errors. On the 
other hand, errors have smaller variances than losses, hence in numerical 
experiments, we often compare errors instead of losses. 


1.7 Marginal Likelihood or Partition Function 


If a prior y(w) satisfies [ y(w)dw = 1, then the marginal likelihood or the 
partition function 7(X”) = Z7(X1, X2,..., Xn) satisfies 


[Fee tn)derdes dey 


= [ew ol) f derdea--- dey [[ nln) = 


i=1 
Therefore Z(xz”) can be understood as an estimated probability density func- 
tion of X” by using a statistical model p(a|w) and a prior y(w). Therefore 
Z(x") is sometimes written as p(x”). 
The free energy or the minus log marginal likelihood is defined by 


F, = —log Z(X"). (1.22) 


Then by using notations q(x”) = []j_, q(a:) and p(x") = Z(2x”), 


—nS = [uc ")log a" 
which shows that E[F,,] — nS is equal to the Kullback-Leibler distance from 
the true density q(x”) to the estimated density p(x”). The smaller E[F,,] is 
equivalent to the smaller Kullback-Leibler distance between them. Note that 
E|G,,] — S is the average Kullback-Leibler distance from q(x) to p(2|X”), 
whereas E[F;,] — nS is their sum. 
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Theorem 2. Let n > 1. The average generalization loss is equal to the 
increase of the free energy, 


E[Ga] = E[Fr41] — E[Fh]. (1.23) 


Therefore the average free energy is the sum of the generalization loss. 


E[F,] = )E[G;] + E[A\]. (1.24) 


Proof. Let Xnj+1 be a random variable which is independent of X” and 
subject to the same probability density function g(a). Then for an arbitrary 
function f(x), 


/ o(«)f(a)dx = Ex,,,[f(Xn41)- 


By using this notation, 


Gn = — | a2) tog p(2|X" de 
= 4 OX nd [log p(Xn4i1|X")] 


n 


[ro abwyeCw) TT p%po)aw 
i=l 


[ ole) T[pcilwjaw 
i=1 


= -Ex,,,[log Z(X"*)] + log Z(X”). 


a se [log 


The expected values over X" of this equation show eq.(1.23). Therefore, 


E|F,] = E[Gn_-i] +E[F,-1] 
= EIGn 1 + EIGn 2| + HF 2| 
n—-1 
= E[G;] + E[Fi], 
i=l 
which shows eq.(1.24). O 


Remark 10. (Marginal likelihood and free energy) By the definition F,, = 
— log Z(X"), the correspondence between the free energy and the marginal 
likelihood is one-to-one. Hence one of them is obtained, and the other can 
be easily derived. However, in general, the asymptotic order of the marginal 
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likelihood as a random variable is not equal to its average, whereas that of 
the free energy is equal to its average. Therefore, in studying asymptotic 
statistics, the free energy is the more convenient random variable than the 
marginal likelihood. Let us illustrate this fact. A marginal likelihood ratio 
function r(x”) is defined by 


Let X” be a set of random variables which are independently subject to 
q(x). Then 


ilr(X™)] = [rte iale" ae = 1; 
Therefore the average of r(X”) is always equal to one. However, 
r(X") +0 in probability. 


For example, for x,a € R, a statistical model and a prior are defined by 


p(a|a) 


| 
al 
% 
a 


Te 
g(a) = =e xp(—> 


and a true distribution is set as q(x) = p(z|0), then 


r(X") = aa ey Ga) 


Since (1/,/n) $0", X; is a random variable which is subject to the standard 
normal distribution, r(X") — 0 in probability. Therefore the order of r(X”) 
is not equal to its average. On the other hand, the order of the random 
variable — log r(X") is equal to its average, because 


— log r(X") = 5 loa(n a 5 los(2r) = CES (= » X;). 
i=1 


Remark 11. (Simultaneous prediction) Let us compare Bayesian estimation 
and the other estimation from the simultaneous prediction point of view. 
Let X41, X2,..., Xn, Xn41,--;Xntm be independent random variables which 
are subject to the same distribution. The simultaneous estimation of 


om \ xX" = \ dts; Kn 42) sey Ayan) 
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for a given sample X” is 


Axer™) 
xnrtrm xn x” = 
Hence a 
Z(x™s) 
pxmtm \ x"1x") =] (Fn) 
j=1 


resulting that 


n+m n my Z(X"45) 
— log p(x Vx") = = Dv 7 (xn eae 
The average of this equation is given by 
E[Gn] + E[Gn4i] +--+: + E[Gnim-i]. (1.25) 


On the other hand, let w be an estimator such as the maximum likelihood 
or the maximum a posteriori method determined by X”. Then the general- 
ization loss of X"*™ \ X” for a given X” is 


—_ S > log p(Xn+j|t), 


j=l 


whose expected value is 


—m x Eflog p(X |w))]. (1.26) 


By eq.(1.25), it is shown that, in Bayesian estimation, the predicted sam- 
ple point is automatically used, recursively. However, by eq.(1.26), in other 
methods, that is not the case. In ordinary cases, the average generalization 
loss E[G,,] is a decreasing function of n, hence, from the simultaneous pre- 
diction point of view, Bayesian estimation is better than the other methods. 
By the same reason, in the prediction of high dimensional X, it is expected 
that the Bayesian estimation has the better performance than the other 
methods. 


Remark 12. (Predictive measure and marginal likelihood) The cross valida- 
tion loss measures the predictive loss which is defined by Kullback-Leibler 
distance between q(x) and p(a|X"), whereas the free energy indicates the 
cumulative loss which is defined by Kullback-Leibler distance between g(X") 
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and p(X"). Both measures are important but different in Bayesian statis- 
tics. If they are used as criteria for choosing the best model or the best 
hyperparameter, then the chosen model or hyperparameter is different ac- 
cording to the criteria. In Chapter 8, we study mathematical properties of 
both measures. 

Remark 13. (Meaning of the marginal likelihood) Assume that Po(p, y) is 


the prior distribution of a model p(z|w) and a prior y(w). Then the proba- 
bility density of X” for a given (p, y) is 


P(X"\p, ¢) = [Tlocsimretm w = Z(X"). 


By Bayes’ theorem, the posterior probability density of (p,y) for a given 
sample X” is 
P(X" |p, 9) Po(p, Y) 
P(X”) : 
Although Z(X") is not strictly equivalent to the maximization of the pos- 


terior probability P(p, y|X”), if n is sufficiently large, the maximization of 
Z(X") becomes equivalent to the maximization of the posterior probability. 


P(p, p|X") = 


Remark 14. (Asymptotic expansions of free energy and generalization loss) 
Let f(n) = E|F,] and g(n) = E[G,,]. Assume that there exist constants 
{A;} and {B;} such that asymptotic expansions 


f(n) = Ain+ AoV/n+ Az logn+ Agloglogn+O(1), — (1.27) 

Bo Bs By 1 

gn) = ut oe a, aloe o( 

hold for n > oo. Then by eq.(1.23), A; = B; (¢ = 1,2,3,4). It is important 

that the constant order term of the free energy f(n) does not affect the 

generalization loss g(n). Sometimes minimization of the free energy changes 

the constant order term but does not minimize the generalization loss. See 

Chapter 8. Also note that, mathematically speaking, even if f(n) has an 

asymptotic expansion, g(n) may not have any asymptotic expansion, how- 

ever, if g(n) has an asymptotic expansion, then its coefficients are uniquely 
determined by the asymptotic expansion of f(n). 


or 1.2 
nie (28) 


1.8 Conditional Independent Cases 


In several practical applications, we need to study cases when X” is de- 
pendent. In this section, let us assume that X” is dependent but Y” is 
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conditionally independent, in other words, Y” is independent for a given 
X”. If a set of R‘-valued random variables Y" = (Y1, Y2,..., Yn) is subject 
to a probability density function, 


q(y1|21)q(y2|%2) +++ ¢(Yn|@n) 


for some fixed 2” = (x1, X2,...,%n), then Y” is called a conditionally indepen- 
dent random variables subject to a conditional probability density function 


nm 
[[ ewila). For an arbitrary function f : («",y") 4 f(a#",y”) € R, the 
i=1 


expected value of f(a", Y”) over Y” is denoted by E| ]. That is to say 


aren =f ff Coral ite ee 
41=1 


which is a function of x”. The average and the empirical entropies of the 
true distribution are respectively defined by 


1 n 
at oe A s)dy, 1.2 
S = > f atte) og q(y|xi)dy (1.29) 
1 n 
—— ] Y;|x;). 1. 
S 2 og g(Yi|xi) (1.30) 


Note that both posterior and predictive probability densities are given by 
the same equations as eq.(1.9) and eq.(1.10), respectively. 


n 


1 


plwle.¥") = er etw) [[rCvilew), (1.31) 
2 i=1 
p(yla,2",¥") = ; p(ylar, w)p(wle”, ¥")dw. (1.32) 


In conditional independent cases, the generalization error is defined by the 
given «”, because the expected value over «x is not defined, 


1 n 
Gn = => | dy ain) oe pun." ¥"), 
i=1 
1 n 
ty = 2S bogntie2".¥) 


i=1 


1.8. CONDITIONAL INDEPENDENT CASES 27 


Also the generalization and training errors are defined by 


a(y|xi) 
nS = = fe X,) log ety: 
(y| g pyle:,2", Y™) y 
_ q(Y¥i|x:) 
= a Do ea ae y")’ 


Both the cross validation loss and WAIC can be defined by the same forms 
as the independent case. 


1 
Cy = —) log Bull/p(¥ilei,w)), 


Wr 


1 nm 
Tr + — > Vullog pVilei, w)]. 
i=1 


However, in this case, 


E[C | # E[Gn il, 


whereas, in Section 5.5, we show 


E|[Wrl = E[Gn] + o(1/n). 


Hence, even if x” is dependent, if C;, is asymptotically equivalent to Wn, 


E[C,] = E[G,] + 0(1/n) 


also holds, whereas if otherwise, then the cross validation loss is can not be 
applied to estimating the generalization loss. 


Example 4. (1) In some applications, regression problems of {Y;} for a given 
fixed set {x;;7 = 1,2,...,n} are studied, then the cross validation loss cannot 
be employed. 

(2) A time series prediction problem, 


A =a, Z-y + a2 A_g + a3 Z,_3 + Gaussian noise, 
can be understood as a regression problem, 
(44-1, Z-2, 74-3) = te OV, = Z. 


Therefore x; is dependent, resulting that the cross validation loss cannot be 
employed, whereas WAIC can be. 
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1.9 Problems 


1. Let wo = (1,1,1,...,1) € R!° and W be a random variable on R! which 
is subject to 


p(w) = C{exp(—||w||?) + 100 exp(—10||w — wol|?)}, 


where ||w||? = >>, w? and C is a constant. Show the following: 

(1) Let w be the maximum point of p(w). Then w = wo. 

(2) E[W] = 0. 

Hence E[W] is far from w. Assume that p(w) is a posterior distribution 
of some statistical model. Discuss the difference between the maximum 
likelihood or a posteriori estimator and Bayesian estimation. 


2. (Fluctuation-dissipation theorem) Let 8 > 0 and H(x) be a function 
from RN to R. Assume that a random variable X € R% is subject to a 
probability density function, 


pl2lB) = ary exp(-BH(2)), 


where Z(() is a constant 


2(8) = f exp(—BH(e))az, 


Then prove the equation, 


JE|H(X)] 
ap 


= -V[H(X)]. 


It is well known that this equation demonstrates several important laws in 
physics. 


3. (Coin toss) Let p(z|a) be a statistical model of x € {0,1} which is defined 
by 


p(zja) = a*(1—a)'*, 


where a is a parameter (0 < a < 1). See Figure 1.9. Let us study a prior 
(a) which is defined by 
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onQ0<a<1. Let X" bea set of random variables which are independently 
subject to p(x|aq). Also let n; and ng be 


n 
m=) Xi; ng =n—-ny. 
i=1 


Then show the following equations. 
(1) The maximum likelihood estimator is @ = n/n and the estimated prob- 
ability distribution p(z|@) by the maximum likelihood method is given by 


p(1|4) = p(0|a) = “25 


n+ 2e’ 


ny +e 
n+ 2e’ 


where € = 0. 

(2) The Bayesian predictive distribution p(z|X”) is 

ng + 1 
n+2° 


nytl 


1X") = 
p(1|x) =, 


p(0|X") = 


(3) The generalization error of the maximum likelihood method is defined 
by Kullback-Leibler information from p(z|ao) to p(z|a@), 

Kut = —aolog((n, + €)/(n + 2e)) — (1 — ag) log((n2 + €)/(n + 2€)) 

+ao log ap + (1 — ag) log(1 — ao), 

where € > 0 is a sufficiently small positive value. Note that, if ¢ = 0, 
0 < ag < 1, and nyn2g = 0, then Kyyp = co. That of Bayesian method is 
defined by Kullback-Leibler distance from p(z|ag) to p(a|X"), 

KBayes = —Aaolog((m + 1)/(n + 2)) — (1 — ao) log((ng + 1)/(n + 2) 

+ap log ay + (1 — ag) log(1 — ap). 


(4) By using numerical calculation, the expected values of E[Kyy,] and 
E[K Bayes] for n = 20, ag = 0.05, 0.10, ...,0.50, and ¢ = 0.0001 are shown in 
Figure 1.9. If ag = 0, then E[Kyyz] = 0 and E[Kgayes| > 0. Discuss the 
difference between the maximum likelihood and Bayesian methods from the 
veiwpoint of the generalization errors. 


4. (Simple normal distribution) Let p(z|a) and y(a) be a statistical model 
of « € R for a given parameter a € R and a prior of a respectively, 


pala) = are ew(-5(e-0)*). 


(z) e(-¥"): 


y(a) 
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Figure 1.7: Comparison of Bayes with maximum likelihood in coin toss. 
The averages of the generalization errors by the maximum likelihood and 
Bayesian methods are compared in a coin toss problem for n = 20. The 
standard deviations of the maximum likelihood method are larger than that 
of Bayes for every ao. 


Let X” be a set of independent random variables which are subject to p(2|0). 
Then prove the following equations. 


1 
2° 


1 n 
Se = Cot ag Lett 
— 


S = Cot 


where Co = (1/2) log(27). The losses are 


1 1 Ny Ny D) 
C= 4 loi) ee 
e Gag OEE Mm +1) Bm +)? 
a é L a 1) ny 1 > "2 
= — x log(1 —- — +>=——-—) Li —2x 
n 0 2 & Ny 2(n1 — 1) n J 
i 1 m il , 
T. = Cy Sloe s—— -) ,-x" 
n 0 5) og( =F Ym ayn Ay x) 
1 1 1 i bn, $321-— 
W, = Oy+—log(1+—)+—— _—_- 5° .— g*)2 
@ 0 5 oa ee me” 2G) a eee 


j=l 
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Figure 1.8: Cross validation and WAIC. This figure shows the histogram of 
lGn — €n| — |gn — Wnl- If |gn — cn| — |Q@n — Wn| > 0 then WAIC is a better 
approximator of the generalization loss than the cross validation. 


where ny = n+ A and 


The free energy is 
Fr, = nd oe (n JA) — ™ ar)? + 2( ) 27) 
n= 0 D S21 a): 


It follows that C, — Wp = Op(1/n3). Let 


In = Gn a, ’ 

Cn = Ch Sn, 

Wn = Wr—Sn. 
Then the histogram of |gn — cn| — |g@n — Wn| for n = 5 is given by Figure 
1.8. In this model, Bayesian observables can be explicitly calculated without 


numerical approximation, hence it is easy to compare the cross validation 
loss and WAIC as estimators of the generalization loss. 


5. (Normal mixture) The Fisher information matrix I;,(w) of a statistical 
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model p(x|w) is defined by 


Lgl) = ic log p(x|w) O; log p(a|w) p(a|w) dx 


where 0; = 0/0w;. Show that the Fisher information matrix of a statistical 
model 
p(ala, b) = (1—a)N(x) +aN(a — b) (1.33) 


where N(x) is the standard normal distribution at (ao, bo) = (0.5,0.3) is 
numerically approximated by 


0.0881 0.1467 
Dousbo = ( 0.1467 0.2552 ) 


whose minimum and maximum eigenvalues are 
0.0029 << 0.3405. 


Note that the minimum value is far smaller than the maximum value. If a 
sample X” is independently subject to p(x|ao, bo) and if regular asymptotic 
theory held, then the asymptotic posterior distribution would be approxi- 
mated by the normal distribution whose average is (ag, 9) and covariance 
matrix is (nI(ao,bo))~'. By comparing Figures 1.4, 1.5, and 1.6 with the 
minimum eigenvalue, discuss the sufficiently large n by which regular asymp- 
totic theory holds. Compare its result with the fact that singular asymptotic 


theory holds even when n is smaller than 100. 


6. (Conditional dependent case) A simple linear regression model of y € R 
for given « € R and paramater a € R is defined by 


p(y|z,a) = = exp(—5(y — ax)?). 


Assume that p(y|x, ao) is a true conditional probability density of y € R. Let 
{x;;i = 1,2,...,n} and {&;7 = 1,2,...,N} be sets of fixed input data used 
in estimation (training) and trial (test), respectively. The set of conditional 
independent data is {(a;, Yi); = 1,2,...,n}. For simple calculation, we 
employ an improper prior y(a) = 1 on R. Then the posterior and predictive 
distributions are respectively given by 


1 Tm 
p(alz”, Y”) = ZL p%leia), 
.— 
p(y|jz) = Eg|p(y|z,a)}, 


1.9. PROBLEMS 33 


where Z is a constant. The generalization error is defined by using the test 
set, 


1 N 
G. = HL [ vlulgis a0) lox *(wlgs)au, 


which is a random variable because the predictive distribution is a function 
of Y". The leave-one-out cross validation loss and the widely applicable 
information criterion for a set are respectively defined by using the training 
set, 


= 5 doa Es Fan a 


i=1 
1 n : 1 nm 
Wn = ——) jlogp*(Yilari) +=) | Vallog p(¥ilei, 4). 
i=l i=l 


The conditional true entropy is 
ea l i 
2 = -2 > | v(ule:,a0) tox p(yles,a0)dy = = log(2r) + =. 
or 2 2 


Then show the following equations. 


it & 
G,)-S = — log (1 e =): 
2N y yo 
Oj ge = log (1 7 ), 
2n dX doin % 
Le i 1 gy gy 
E[W,]-S = log(1+ =z) += J 62 
| - X — doin1 2 an a Tj? 
where 
a 
Ti 
it 2; 
Prove that, ifn = N and & = x; for all i, 
E|Gn] < E[W,] < E[C,], 


where equalities hold if and only if 2; = 0 for all 7. 


Taylor & Francis 
Taylor & Francis Group 
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Chapter 2 


Statistical Models 


In this chapter, we introduce several concrete examples of statistical models. 
The main purpose of this book is to establish the mathematically univer- 
sal theory in Bayesian statistics which holds even for nonregular statistical 
models. However, before studying the general formulas, concrete examples 
are prepared for understanding them. We introduce 

(1) Normal distribution 

(2) Multinomial distribution 

(3) Linear regression 

(4) Neural network 

(5) Finite normal mixture 

(6) Nonparametric mixture 

and then examine the behaviors of the free energy or the minus log marginal 

likelihood, and the generalization, training, cross validation losses, and WAIC. 
The statistical models (1), (2), and (3) are regular, whereas (4), (5), and 

(6) are nonregular. If a reader has software for numerical calculation, then 

it will be easy to realize them. 


2.1 Normal Distribution 


Firstly, we study a normal distribution. Let us use a probability density 
function of « € R for a given parameter w = (m,s) € R?, (s > 0). 


p(zlm,s) = ys ex7(-$le - m?). (2.1) 


The probability distribution represented by this density function is denoted 
by N(m,1/s), where m is the average and 1/s is the variance. It can be 


30 
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rewritten as 


1 8 9 ms 
plzilm,3) = —— exp(-S +msx — — 


V2T 2 2 


The conjugate prior y(m, s|¢1, 62, 63) of the normal distribution is the same 
function of a parameter as the statistical model, by replacing (x7, 2,1) by a 
hyperparameter $ = ($1, $2, $3), 


oa ) 
Pie ie 


y(w|¢) = y(m, 8|¢1, b2, b3) 
= Fw ( 501 + msde 5(m?s —log s)¢s)), (2.2) 


where w = (m,s). It follows that 


(6x — 28) (+4 exp m— 22)9) 


The constant Y(¢) is determined by the condition [ y(w|¢)dw = 1. By 
using integral formulas, 


[- exp(—ax”)dx = 1/a, (2.3) 
[ a*—'exp(—a/b)dx = b*T(a), (2.4) 


where I'(a) is the gamma function, the function Y(@) is given by 


Y(g) = 


2/F(2b3)0/2 ds +1 
Geen 5 } (2.5) 


To ensure 0 < Y(¢) < co, the hyperparameter should satisfy ¢3 > 0 and 
b1¢3 — 63 > 0. Since the conjugate prior has the same form of the parameter 
as the statistical model, the posterior simultaneous density of (w,X"”) is 
given by 


n 


ow) | [ p(Xi|w) 


i=1 


1 A x 1 x 
= Youn? exp(—541 + ms — zlm’s — log s)s), 


O(w, X”) 


2.1. NORMAL DISTRIBUTION 37 


where 
dp = SOX? +41, (2.6) 
= 
by = SX; + 2, (2.7) 
=i 
o3 = n+ 9x. (2.8) 


The partition function is given by 


Y (6) 


Z(x") = [wx dw = Yam?’ 


and the posterior distribution, which is equal to Q(w, X")/Z(X"), is given 
by oo. 

p(w|X”) = y(m, s|?1, 2, 3). 
Hence the minus log marginal likelihood or the free energy F = — log Z(X") 
is 


F, = log(2m) + log Y(¢) — log Y(8). (2.9) 


See Figure 2.1. The predictive density is also given by 


/ p(w) p(w|d)dw 


eS Y (¢1 + 2?, 2 +: 2, d3 + 1) 
V2n ¥ (br, bn, $2) 


Ew [p(x|w)| 


[(x — bo/b3)? + Cy] G2+0)/2 ie 


where C1 = (¢163 — 63)(¢3 +1)/ de. The predictive density is different from 
any normal distribution. However, when n —> oo, it converges to a normal 
distribution. See Figure 2.2. The training loss is 


1 nm 
T, = —— ) log Ew [p(Xi|w)] 
=1 


5 log(27r) + log Y (¢1, oo, $3) 


lV , ; i 
—— ) log ¥ (b1 + X?, da + Xinbs + 1). (2.11) 
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Since 


5~ Y¥(b1 — 2, 62 — 2,3 — 1) 
Ew|1 =e ey, 
[ /p(x|w)| i Y (41, G2, 3) 


the cross validation loss is equal to 


1 n 
Ca = = 28 ‘w(t /p(Xilw)] 
1 a 
= 9 log(27) ~~ log Y(¢1, 2; 3) 
ite - > - 
i 28 ¥ (61 — X?, 2 — Xi, ds — 1). (2.12) 
Let f(a,2) be a function 


flat) = log f plelw)%p(w/d)aw 
= log| 5 ¥(¢.+ on", bo + ea 3 + a) 
(2n)o/? Y (¢1, 62, 63) 
Since the posterior distribution is equal to y(w|¢), 
of 


Faz 0 *) 


Vlog p(z|w)| 
2 a a 
= S[k les +0) log(2(ds + 0)) 
-5(d3 +a+t1) log{(d1 a ax?) (ds +a) — (bo +azx)*} 


+ logh((ds + a+ 1)/2)]| 


a=0 


where w(x) = (logI'(x))” is the trigamma function and 


_ 3a? — 2boa + ov 


Me Te — (bo)? 


Therefore WAIC is given by 
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Fn -n Sn: Free Energy - n*Entropy 


n 1 1 1 
Oo 200 400 600 800 1000 


n: Sample Size 


Figure 2.1: Free energy for n = 1,2,...,1000. The free energy or the minus 
log marginal likeihood of a normal distribution F,, — nS, is shown for n = 
1,2,...,1000. Its asymptotic behavior is given by log n. 


By using the function Y(@) in eq.(2.5), we can calculate F,, T,, and Cy, 
using X”, based on eqs.(2.9), (2.11), and (2.12). Let us assume that a true 
distribution is g(a) = p(x|mo, so). Then the entropy and empirical entropy 
of the true distribution are respectively given by 


— — f a(e)th tog 22) — 22(@ — mo)? 
Ss = = fale) 5 loe( $2) — Be ~ mo)? faz 
_ 1 SO 1 
~gleelg) +p 
_ 4 §0) , 80 “ = 2 
Sn = —slog(2) +2 dom mo)?. 


The training and cross validation errors are respectively given by 7, — Sy, 
and C;, — T,. The generalization loss is given by 


Gn = SH+ [wo log Sa 
= 5 log(2n) + log ¥(b1, 42, 4s) 
= / q(a) log Y (d1 + x”, do + 2, 63 + 1)dz. 


Unfortunately, the integration over q(x) cannot be done analytically. 
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Figure 2.2: Bayesian observables in normal distribution are shown in the 
case n = 10. Histograms of (1) training error T,, — S,, (2) cross validation 
error Cr — Sn, (4) generalization error Gy, — S$, and (5) WAIC error Wn, — Sn. 
Distributions of (3) (G, — S,C, — Sn), and (6) (G, — S,W,, — S;,). Note 
that WAIC error has smaller variance than cross validation error. 


Example 5. A numerical experiment was conducted by setting (mo, so) = 
(1,1) and (¢1, ¢2, 43) = (0.5,0,0.5). In Figure 2.1 the horizontal and ver- 
tical axes show the sample size n and the free energy or the minus log 
marginal likelihood minus empirical entropy F;, — nS, for n = 1, 2,..., 1000, 
respectively. In the following sections, we will show that F, — nS, = 
(d/2) logn + O,(1) (d is the dimension of the parameter space), which is 
consistent with the figure. In Figure 2.2, experimental results of 10000 ind- 
pendent trials for n = 10 are shown. 

(1) Histogram of the training error, T;, — Sy 

(2) Cross validation error, C;, — Sy, 

(3) Generalization error and the cross validation error 

(4) Histogram of generalization error, G— S$ 

(5) Histogram of WAIC error, W, — Sp, 

(6) 


6) Generalization error and WAIC 
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The averages and standard deviations of G, —S, C, — Sn, and W,, — S;, are 
numerically approximated by 


Average: 0.0901, 0.1017, 0.0860, 
Standard deviation: 0.0978, 0.1211, 0.1136. 


The variance of the cross validation error is larger than WAIC error. More- 
over, 


nC, = 3) 1G, — Se) = 188, 
E|(Gn — 8) —(Wn — Sn)| = 0.146. 


Therefore, WAIC is the better estimator of the generalization error than the 
cross validation error in this case. The inequality 


E|(Gn — S) — (Cn — Sn)| > E|(Gn — S) — (Wn — Sn)| 


holds for n = 1,2,3,4,5. In the following chapters, we show the higher order 
asymptotic equivalence of the cross validation and WAIC as n > oo. For the 
finite and smaller n, WAIC is the better approximator of the generalization 
loss than the cross validation loss in many statistical models. 


2.2. Multinomial Distribution 


A multinomial distribution is examined. Let N be a positive integer. An N 
dimensional variable 
a = (c, 22), 2%) 


is said to be competitive if only one element is equal to one, 2%) = 1, 
and others are equal to zero, x“) = 0 (k # j). The set of N dimensional 
competitive variables is denoted by 


Cn = {x = (2), 2®,.-- 2); ao is competitive}. 


By the definition, the number of elements of the set Cy is equal to N. This 
set is used in classification problems where N is the number of categories. 
In a coin toss problem, N = 2, whereas in a dice throw problem, N = 6. 
The N dimensional multinomial distribution of one trial « € Cy for a given 
parameter w is defined by 


p(a|w) = | | w;)”, (2.13) 


i 


il 


J 
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where we determine 0° = 1 and a set of all parameters is 
N 
W = {w = (wy, w2,--- , wy); Sow; =i, a, > 0}. 
j=l 


If a random variable X = (X(t) X@),..., X()) is subject to p(x|w), then for 
an arbitrary j, the probability X = 1 is equal to 


Prob(X) = 1) = Wj. 


The Dirichlet distribution on W is often employed as a prior, 


N 
v(wla) = aa [] (ey (So wy 1), (2.14) 


where a hyperparameter a is an N dimensional vector, 
a= (a1, 2, +++) an), 


which satisfies aj > 0 (j = 1,2,...,.N) and 


N 4 N 
Cla) = (II / dw; (ws)%-1) 10a ~ i). 


Lemma 1. The normalizing term of Dirichlet distribution is 


1 (2.15) 
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Then f(1) = C(a). The Laplace transform of f(A) is 


[oe] N [oe] 
\Ne**dA = “i+ exp(—Bw,)dw; 
[tore I [con exp(—Buj)auw, 


—— T(a 
(pe 


— It) [zee te Pan 
P(d0; 45) 0 


Therefore, by using the inverse Laplace transform of this equation, f(A) is 
obtained, 
I], ACs: aj—1 


dia 
Then f(1) gives eq.(2.15). O 


Let us make the posterior distribution for a multinomial distribution and 
Dirichlet prior. Let X” be a set of n random variables on Cy, 


= {X, =(X,..., xX); §=1,2,...,n}. 


a 


A random variable nj; is defined by 


x, 


which is the number of sample points classified into the jth category. Then 
>; i =n and the posterior distribution is 


N n 
pwf|X") = meg a Te)" TT; yo (Ses-1) 
i=1 


Vow 
_ 1 1 = \nj+aj—1 5 1 
~ Z(X") Cla) Tes) (X ei ) 


By using a notation 7 = (nj, 2,...,nn), eq. (2.15), and [ p(w|X")dw = 1, 
the marginal likelihood or the partition function a 


Z(X")C(a) = C(w +a). 
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Therefore 
Tj P(n; + aj) Se) 
Tint yo, a;) TEL, P (aj) 


Thus the minus log marginal likelihood or the free energy is given by 


Z(X") = 


N N 
F, = logl n+ > a5) ~ Leal ng + a,) 
j=l j=l 
N 
—logl() aj) PS lee T(aj). (2.16) 
j=l j=l 


The predictive distribution p(2|X") of a = (c«@, 2), ...,2)) is given by 


p(2|X") = Ey [p(z|w)] = f vl2lw)p(wix")aw 
A(x") 
Z(X") 
Til (e® +n; +a;) Dnt a) 
Tint+1+ fia) jl Pj +a;) 


where we used a notation X"*! = (X",x). By using (2 + 1) = 2I'(z) for 
an arbitrary x > 0, it follows that 


TIj La (nj + a5)" 


p(2|X") = W 
+ Dar G5 


Thus the training loss is 


Tn 


jie r 
i S “log p( Xi] X ) 
i=1 


N n N 
log(n + yy aj) — > Xj log(nj + a;) 


i=1 j=l 


log tw ~ So Y togtas +45) 


j=1 


2.2. MULTINOMIAL DISTRIBUTION 45 


A part of the cross validation loss is 
Z(X"\ Xi) 
Z(X”) 
N 
Ta P(X)? + nj taj) Pt DjL4i) 
a 
Din — 14 dija1 ay) [jar P(ry + 5) 
n—1+ yi Oy 
N (3) * 
[ja15 + aj — 1)*: 


Hence the cross validation loss is given by 


Bw [1/p(Xi|w)] 


Cy = => log Eull/p(%i1X") 


i=1 
N 1 n N ; 
= log(n—1+$°a;)- 7 S>S- xX! log(nj + a; — 1) 
j=l i=1 j=1 
N N 
= log(n — 1+ Yay) — S) 4 log(nj = = is) (2.17) 
j=l j=l 


The above equations for F,, T,, Cr, and W, can be applicable in any 
sample X”. However, if we adopt the hyperparameter a = (aj, a2,...,@N) as 
0 <a; <1, and if at least one of n; (j = 1,2,...,.N) is equal to zero, then 
the cross validation loss diverges. In practical applications, we had better 
remark that, if N is large or n is small, the data contains n; = 0. Then by 
using 
; (i) 
[vlc p(w) X" dw _ El i) ee ee. (2.18) 
Tinta+t >), a;) [1]; P(n; + a;) 
where )> j and |] j Tepresent the sum and the product for 7 = 1,2,3,...N 
respectively, we can derive 


Vellogp(ale)) = 2 [los f r(alw)*rCul xa] 


where w(x) = (logI'(x))” is the trigamma function. Therefore WAIC is 
given by 
Wr =Tn t+ Vn, 
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where 


Assume that p(z|wo) is the true distribution, where 
wo = (wo1, Wo2; «+, Won’): 


The generalization loss is explicitly given by 


Gn = — > p(2|wo) log p(x|X”) 


N pa. vel N . 
= See + a;) ) (193)? 


N 
a re ere] 


N N . 
= log(n+ > aj) — i, at) log(n; + a;)} [[ (03)? 
= 


x j=l j=l 
N N 
= log(n+ S° aj) — S© wo, log(nj + a5). (2.19) 
j=l j=l 


The entropy and the empirical entropy of the true distribution are respec- 
tively given by 


S = —)/p(2|wo) log p(alwo) 


I 
| 
— 
€ 
(=) 
Ss. 
WY 
8 
Ss 
— 
fe) 
0g 
prs, 
= 
erry 
€ 
bol 
8 
S 
~—" 


= —S\ wn; log W075 (2.20) 
j=l 


1 n 
Sn = —— ) [log p(Xilwo) 


i=1 
N on 
= -5 — log wo;. (2.21) 
mr 
j=l 


Hence the generalization error can be calculated. 


Example 6. An experiment was conducted for the case N = 5, wo = 
(0.1, 0.15, 0.2,0.25,0.3), and a = (1.1,1.1,1.1,1.1,1.1). Figure 2.3 shows 
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Fn -n Sn: Free Energy - n*Entropy 


1 1 1 1 
Oo 200 400 600 800 1000 


Sample size n 


Figure 2.3: F, — nS, for n = 1,2,...,1000. The free energy or the mi- 
nus log marginal likelihood of the multinomial distribution is shown. The 
asymptotic behavior of the random variable Ff, — nSp, is in proportion to 
log n. 


experimental results of F, — nS, n = 1,2,3,..,1000. Figure 2.4 shows ex- 
perimental results for n = 20. 

(1) Histogram of the training error, T, — S;, 

(2) Cross validation error, C;, — Sy, 

(3) Generalization error and the cross validation error 

(4) Histogram of the generalization error, G — S$ 

(5) Histogram of WAIC error, W, — Sp, 

(6) Generalization error and WAIC 

The averages and standard deviations of G, —S, C, — Syn, and W, — S;, are 
numerically approximated by 10000 independent trials. 


Average: 0.0684, 0.0700, 0.0672, 
Standard deviation: 0.0502, 0.0749, 0.0749. 


The variance of the cross validation error is almost same as the WAIC error. 
Moreover, 


E\(G,—S)—(C,.—S,)| = 0.0948, 
E|(G, —S)—(Wn—Sn)| = 0.0943. 


Therefore, in this case the cross validation error is almost same as the WAIC 
error. 
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Figure 2.4: Bayesian observables of multinomial distribution for n = 20. 
Bayesian observables of multinomial distribution are shown with n = 20. 
Histograms of (1) training error T,, — S,, (2) cross validation error C,,— Sy, 
(4) generalization error G, — S, and (5) WAIC error W,, — S;,. Distributions 
of (3) (Gp — S,Cy — Sy), and (6) (G, — S,W, — S,). 


2.3 Linear Regression 


In the foregoing two models, the posterior averaging could be done analyti- 
cally. In this section, we study a linear regression model, where the posterior 
averaging is numerically approximated by random sampling. Let us study 
a statistical model and a prior which are defined for x,y,a € R', s > 0 by 


rlule.as) = \/ SX exr(-$(y-a2)?), 


el@slr) = yyqay 8 exe(-3(? +0), 


where r is a hyperparameter. This prior is proper if and only if r > —1/2. 
The normalizing constant Y (@, yw, p) is given by 


Y (@, 11, p) 27 aa [ ds s+#/2 exp(—= 5 (ua® + p)). 


2.3. LINEAR REGRESSION A9 
Lemma 2. The normalizing constant Y (0, 4, p) is equal to 
Y(é,u,0) = (2n/u)/?(2/p) te VPT(r + (€ + 1)/2). 
Proof. By the definition, 
i, da I ds s"+#/2 exp(—5 (ua + p)) 
I ds st? (In | sps)'/? exp(—) 


= = S 
n/n)? f ds grtt/2 "7? exp(——) 


YG ie) 


(2m /p)'/2(2/p)r+8/241/2 1 ds gh te/2-1/2 exp(—s) 
0 


(2m /p)'/?(2/p) °F" OP T(E + ar + 1)/2), 


which completes the lemma. O 


Let (X",Y") = {(Xi, Yi);¢ = 1,2,...,n} be a sample which is inde- 
pendently taken from a true probability density q(x, y). The simultaneous 
distribution Q(a,s,Y¥"|X") for given X” using the statistical model is given 
by 


Q(a,s,¥"|X") = o(a,sir) |] p(¥ilX,a, s) 


i=1 
— 25 2 2 
7 Y (0, 1, 1)(27)"/2 exp(—5()_(%i =i isa’) 
? 5) a 
gn/2+r 


= FUL eae PGA BY + OH, 


where A, B, and C are constant functions of the parameter, 


k= yeaa, 
a 
axe HD 
(See 2 
C= {-— 5 +) ¥f4+'it. 
y0, X? +1 » : 
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Note that, even if a prior is improper, the posterior distribution is well- 
defined if (n —1)/2+r>-—1. Asa result, the posterior density is 


pla, s|X",Y") = Q(a, 8, ¥"|X"), 


1 
Z(X",¥?) 
where the partition function or the marginal likelihood is 


Y(n,A,C) 


Z2(X",¥") = | O(a,s,¥"|X")dads = —@ ©) _ 
ey [as |X" )dads = Gn BY (0,1, 1) 


The free energy or the minus log marginal likelihood is 
= 5 log(2n) + log Y(0,1,1) — log Y(n, A,C). 


Then the posterior density can be decomposed as 


pas x". "|: == pals, Xx". ¥") pis|x*). (2.22) 
A 

PIGS a ee exp(-(a BP): (2.23) 

p(s|X") og —D/2tr exp(-S). (2.24) 


Thus a set of parameters {(a;,5;);t = 1,2,...,0} which is independently 
subject to the posterior density can be obtained by the following procedure. 
Firstly each variable in {s;} is independently taken from p(s|X"”), then each 
variable in {a,} is independently taken from p(a|s,X",Y"). Then the pos- 
terior average of a function f(a,s) can be numerically approximated by 


1 EE 
4(a,s) [f(a, s)| ~ T > f(a, St). (2.25) 
=1 


The generalization, training, and cross validation losses and WAIC are ap- 
proximated respectively by 


T 
i 
cx = [amt ha) 
t=! 
1X re 
T, & —— ST log (= do p(MiIX:, a, s1)), 
ee t=1 
ee ine 
Cn = —Srlos( ZS 1p(ViIX, a0, 51), 
i=1 t=1 


= 


In + Vn, 


2.3. LINEAR REGRESSION ol 


where V,, is the sum of the posterior variance, 


n 


1 io 
= 2 
= =o) Ag DI log p(¥i|Xi, at, 8¢)) 


T 


. (= S "(log p(¥i] Xi, ae, s))) }. 


t=1 


Note that replacement of V,, by V,T'/(T — 1) gives the unbiased estimation 
of the posterior variance. If a true distribution q(x, y) = q(x)q(y|x) is equal 
to 


1 x? 
q(t) = TR oe 
q(y|z) = p(ylx, ao, so), 


then the entropy and the empirical entropy are respectively given by 
Sn = log(/2) + 230 X)) 
n = 5log(n — ao 


1 1 


Remark 15. Let us define the Akaike information criterion in Bayes (AIC,) 
and deviance information criterion DIC by the following equations. 


AIC, — i ae 
n 
Ss ieee 
pe = a > FD, os r(VlX aes) 


1 n 
+— »_ log p(¥i|Xi,a,5), 

where d is the number or dimension of parameters and (@,%) is the empirical 
mean of the posterior parameters {(a;5;)}. The criteria AIC, and DIC 
can be understood as estimators of the generalization loss. Note that the 
conventional AIC is defined by using the maximum likelihood estimator, 
whereas AIC, is by the Bayesian predictive density. Hence AIC 4 AIC, in 
general. If a true distribution is realizable by a statistical model and if the 
posterior distribution can be approximated by some normal distribution, 
then AIC, AIC), and DIC are the asymptotically unbiased estimators of the 
generalization loss. If otherwise, the statement does not apply. 
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Figure 2.5: Standard deviations caused by posterior sampling. In a simple 
linear regression problem, standard deviations of the generalization, cross 
validation, WAIC, AIC,, and DIC errors are compared as a function of the 
number of the posterior parameters. 


Example 7. For a case when r = 1, (ao, 80) = (0.3,0.5), and n = 30, an 
experiment was conducted. The true distribution of X was set as the stan- 
dard normal distribution and that of Y is ajX + N(0,1/so). Firstly, let 
us examine the fluctuation caused by posterior sampling. The standard 
deviations of these observables for a fixed sample (X",Y”) in the cases 
T = 100, 200, ..., 1000 are shown in Figure 2.5. The posterior standard devi- 
ation of the cross validation errors are larger than other errors. The standard 
deviations were 


a(AIC,) < o(DIC) < o(WAIC) < o(cross validation). 


Note that this order is not yet mathematically proved, and it may depend 
on the conditions about the true distribution and a statistical model. In 
fact, if the posterior distribution can not be approximated by any normal 
distribution, the variance of DIC becomes far larger than others. Secondly, 
we study the averages by comparing them as functions of hyperparameters, 
r = —3,—-1,1,3,5. For r < —1/2, the prior is improper. However, the pos- 
terior is well defined and the cross validation loss and WAIC can estimate 
the generalization loss. In Figure 2.6 the averages of the generalization er- 
ror, the cross validation error, WAIC error, AIC; error, and DIC error are 
compared. In this experiment their averages were numerically calculated 
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Figure 2.6: Information Criteria as a function of a hyperparameter. In a 
simple linear regression problem, the averages of the generalization, cross 
validation, WAIC, AIC,, and DIC are compared as a function of a hyper 
parameter r. These are averages over 1000 independent samples with n = 30. 
Both the cross validation error and WAIC error exhibited the same behaviors 
of the generalization error, whereas neither AIC, nor DIC did. 


using 1000 independent samples of (X",Y"), n = 30. As estimators of the 
generalization error, neither AIC; nor DIC gave an appropriate function of 
the hyperparameter, whereas both the cross validation and WAIC did. In 
the following chapters, we show that the averages of the cross validation loss 
and WAIC have the same higher order asymptotic behavior as the general- 
ization loss, resulting that they can be employed in evaluation of the average 
generalization loss as a function of a hyperparameter. 


2.4 Neural Network 


In many statistical models, the posterior average cannot be calculated an- 
alytically, hence the Markov chain Monte Carlo method is necessary (see 
Chapter 7). Moreover, in statistical models which have hierarchical struc- 
tures or hidden variables, the posterior distribution cannot be approximated 
by any normal distribution. An example of such statistical model is an ar- 
tificial neural network. It is a function f(z,w) = {f;(x,w)} from x € RY 
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to R% which is defined by 


K L 
f;(z, w) = (>. ujho(>_ Wrewe + Ox) + ;), (2.26) 


k=1 (=1 
where o(t) is a sigmoidal function of t, 


1 
~ 1+ exp(—t) 


a(t) 


and a parameter w is 
w= {tan Wke; Qj ) Ox}, 


where uj; and wye are called weight parameters and ¢; and 0; bias param- 
eters. This function is called a three-layer neural network. Recently statis- 
tical models which have deeper layers are being applied to many practical 
problems. The statistical model for a regression problem using a three-layer 
neural network is a conditional probability density 


Pluie.) = rae exr(—glly—few)l). (2.20 


A neural network can be used also for classification problem. For the case 
y € {0,1} its statistical model is represented by 


p(ylz,w) = f(x, w)¥(1 — f(x, w))'¥. 


In this model, f(x, w) is used for estimating the conditional probability of 
y for a given x. By setting a prior on the parameter w, the posterior and 
predictive distributions are numerically. Since the posterior distribution 
cannot be approximated by any normal distribution, neither AIC nor BIC 
can be used for evaluation of a model and a prior. However, WAIC and 
ISCV can be used. 


Example 8. An experiment was conducted about a neural network which had 
no bias parameters. An input sample {z; € R?;i = 1,2,...,n} (n = 200) 
was taken from the uniform distribution in [—2,2]?. The true conditional 
distribution was made by p(y|x, wo) where p(y|xz, wo) was a neural network 
with three hidden units H = 3. A prior was set by the normal distribution 
N (0, 107) on each ujz and wye. The posterior distribution was approximated 
by a Metropolis method (see Chapter 7.1). We prepared five candidate neu- 
ral networks which have H = 1,2,3,4,5 hidden units. Figure 2.7 shows the 
results of 20 trials of: 
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ISCV 


Hidden Units Hidden Units 


Figure 2.7: Classification problem. Model comparison of neural networks 
is studied. The true number of the hidden units was three. Both ISCV 
and WAIC errors correctly estimated the generalization error. AIC overes- 
timated the generalization error, and DIC did not work appropriately. 


1) Upper, left: G—S for H = 1,2,3,4,5 

2) Upper, right: AIC, — S,, for H = 1,2,3,4,5 

3) Lower, left: ISCV — S, for H = 1,2,3,4,5 

4) Lower, left: WAIC — S,, for H = 1,2,3,4,5 

where G, ISCV, AICy, and WAIC are calculated by using the Markov chain 
Monte Carlo method. In this problem the values of DIC were quite dif- 
ferent from others, which are not appropriate for evaluation of hierarchical 
statistical models. Note that in a neural network the posterior average of 
the parameter has no meaning. In Bayesian estimation, the generalization 
errors of a neural network did not so increase even if the statistical model 
was larger than a true model. This is the general property of Bayesian in- 


Sie ei) ea pa 
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ference in hierarchical models, whose mathematical reason will be clarified 
by Chapter 5. Even in the case H = 3, in which the statistical model is 
just equal to the true distribution, the generalization error was sometimes 
not minimized. Such a phenomenon was caused by the local minima of the 
Metropolis method in neural networks. Both ISCV and WAIC correctly es- 
timated the generalization errors, whereas AIC, overestimated. Note that, 
in Bayesian estimation, the increase of the generalization error is very small 
even if a statistical model is redundant for a true model, resulting that the 
increases of both ISCV and WAIC are also small. In selection of hierarchical 
models, a statistician should understand this point. 


2.5 Finite Normal Mixture 


Another example of nonregular statistical models is a normal mixture. Let 
N(za|b) be a normal distribution of c € R” whose average is b € R™, 


lay, 


1 


A normal mixture on R™ is defined by 


K 
ip(z\a,b) = S/anN(aldx); (2.28) 
k=1 


where a = (a1, 49,...,aK) and b = (by, bo, ...,bK) are parameters of a normal 
mixture, which satisfies or = 1 and a; > 0, and by € R¢. The finite 
positive integer K is called the number of components. For the prior, we 
adopt 


i= 
g(a) = — | [ (ax), 
AL =i 


K 
et) = = [Je(—zzallul”) 


where y(a) and (b) are the Dirichlet distribution with index {6;} and the 
normal distribution, respectively. Here 6,,07 > 0 are hyperparameters and 
21, 22 > O are constants. 
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A variable y = (y1, Y2,---, yx) is competitive if y, = 1 for some k and if 
ye = 0 for other #4 k. A statistical model on (x,y) is defined by 


~ 


p(x, ya, b) = TT (oun (x|bp) ))" ; (2.29) 


Then 
p(xla,b) = )~ p(x, yla,b), 


y 


where the summation is taken over all competitive y. In a normal mixture, 
by understanding Y” = {Y;} as hidden or latent variables of a statistical 
model p(x, y|a,b), the posterior distribution of (a,b, Y”) is given by 


p(a,b, ¥"|X") x ya)y(b) [] p(X, Yila, 0). 
i=1 


By using the Gibbs sampler, which is explained in Chapter 7, we obtain the 
posterior samples {az,b;, Y;"}. Hence by using {a;,b:}, the posterior and 
predictive distributions are numerically approximated. 


Example 9. Let us study a case M = 2,n = 100, where a true distribution 
was set as: 


A= SN (al(-2, _2)) + SN (al (0. 0)) + SNeal(2 2)). (2.30) 


The hyperparameters of the prior distribution were a = 0.5 and o = 10. 
Fifty independent trials were collected and the generalization and cross val- 
idation losses were observed. The candidate statistical models were kK = 
1,2,3,4,5. The posterior distribution was numerically approximated by the 
Gibbs sampler. We calculated G— S, CV — S,, AIC, — S;,, DIC — Sy, 
and W AIC — S,,. Figure 2.8 shows their averages and standard deviations. 
The generalization loss does not so increase even if the statistical model is 
redundant for the true distribution. In fact the generalization loss in Figure 
2.8 does not increase as AIC. This is the advantage of the Bayesian estima- 
tion. However, the increases of the cross validation and WAIC are too small 
compared to random fluctuations. In practical applications, if the increases 
of the cross validation and WAIC are far smaller than d/n (d is the number 
of paramters and n is the sample size) even if the model becomes more com- 
plex, then the minimal model in the set of the models which gives almost 
same cross validation and WAIC should be selected. 
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Figure 2.8: Model comparison of normal mixture. Model comparison of a 
normal mixture is studied. The true density corresponds to K = 3. The 


averages of ISCV and WAIC errors were equal to that of the generalization 
error, whereas those of DIC and AIC not. 
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2.6 Nonparametric Mixture 


A nonparametric normal mixture is defined by 


K 
=i { N(a\b hi 
ploasb) = fim {3 au (al) 
where a = (aj, 4@2,...,) and b = (bj, be,...) are infinite dimensional parame- 
ters, which satisfy ‘>a, = 1, and ag > 0. The prior distributions are set 
by 


i K 
plaja) = tim {—— TJ (ay/ 14, 


; — dng 2 

ele) = dim {oy [Le gealltel)}. 

where z1(a) and zg(c) are normalizing constants. From the mathematical 
point of view, this model is defined by using Dirichlet process theory [22, 23], 
by which it is shown that p(z|a,b) is given by the discrete summation with 
probability one. See Section 10.6. 

Although the parameter belongs to the infinite dimensional space, we can 
construct Markov chain Monte Carlo method by manupulating essentially 
finite dimentional parameters, resulting that the posterior and predictive 
distributions are numerically appoximated [21][32]. For example, a Chinese 
restraurant process [47] and Stick-breaking process [39] are proposed. In [39], 
it is also proved that nonparametric Bayesian estimation can be accurately 
approximated by a finite mixture model. 

Let y = (Y1, Y2,---, Yk; ---) be an infinite dimensional competitive variable. 
Only one k, yz = 1 and others are zero. A statistical model on (x,y) is 


defined by 
K 


p(x,yla,b) = tim {TJ (an (ald) }. (2.31) 


Then 
p(ala,b) = S— p(a, yla,b). 
7] 


By using the hidden or latent variable Y” = {Y;}, the posterior distribution 
of (a,b, Y”) is given by 
n 
p(a,b, ¥"|X") x p(a)y(b) [J p(X, Yila, b). 
i=1 
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(1) True Distribution (2) Sample, n = 100 


Figure 2.9: Nonparametric and finite mixture. (1) The true distribution 
cannot be represented by any finite mixture. (2) A sample from the true 
distribution, n = 100. (3) An estimated result by nonparametric Bayes a = 
1. The generalization error was 0.893. (4) An estimated result by a finite 
mixture. The generalization error was 0.854. Even if a true distribution 
is represented by a nonparametric model, the nonparametric method is not 
always appropriate for statistical estimation. 
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Figure 2.10: Hyperparameter optimization in nonparametric Bayes. (1) The 
horizontal and vertical lines show the log hyperparameter and the average 
number of the components. If the hyperparameter increases, then the num- 
ber of components becomes larger. (2) Generalization error, WAIC error, 
and ISCV error are compared with respect to the hyperparameter log a. 


By using the Gibbs sampler, which is explained in Chapter 7, we obtain the 
posterior samples {a;, b;, Y;"} which consist of finite dimensional parameters 
such that the posterior distribution is numerically approximated. Hence 
we obtain the posterior and predictive densities, resulting that information 
criteria can be calculated. Sometimes one might think that neither model 
selection nor hyperparameter optimization could be necessary in the non- 
parametric method because they should be automatically estimated. How- 
ever, such consideration is wrong. In general, estimation of something needs 
its prior. In other words, estimation of a model and a hyperparmeter recur- 
sively requires new priors on them. Thus we need the evaluation procedure 
for preventing the infinite preparation of priors. 


Example 10. A true distribution on x € R? was set as 


Ko 


ge) = lim {5 aonN(elbox) }, 
=1 


Ko—0o 


where N(z|box) is the normal distribution on R? and bog = (bo1K, bork) is 
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(1) Cross validation error (2) WAIC error (3) Generalization eror 
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Figure 2.11: Fluctuations of Bayesian observables. Fluctuations of observ- 
ables in nonparametric Bayesian estimation are shown for 100 independent 
samples. The horizontal lines are loga. The vertical lines are (1) cross 
validation, (2) WAIC, and (3) generalization errors. The lines connect the 
results of the same sample. The optimal a could be found by the cross 
validation and WAIC. If a@ was made smaller then the standard deviation of 
the generalization error became larger. 


defined by 


aon = 1/Ko, 
boiz = 3cos(27k/Ko), 
boxer = 3sin(27k/Ko). 


The density function of g(x) is shown in Figure 2.9 (1). Note that the 
true density is not realizable by any finite mixture. (2) A sample n = 100 
independently taken from q(x) is shown. (3) shows an estimated result by 
nonparametric method with a = 1. The generalization error was 0.0893. (4) 
An estimated result by a finite mixture with a = 10 and K = 10 is shown. 
The generalization error was 0.0824. 

If we employ the nonparametric Bayes method, the hyperparameter a 
should be controlled appropriately, because it strongly affects the estimated 
result. If @ is close to zero, then the average number of components be- 
comes too small. If otherwse, too large. In Figure 2.10 (1), the average 
number of the components for a given @ is shown. One might think that 
the optimal model selection could be done by the nonparametric method, 
but such consideration is wrong, because the model selection problem is 
replaced by the hyperparameter optimization. Also one might think that 
the hyperparameter @ could be optimized by its posterior distribution by 


2.7. PROBLEMS 63 


preparing its prior, but such consideration is also wrong, because the prob- 
lem is also replaced by the optimal setting of the hyperprior of a. In other 
words, the nonparametric Bayes method does not realize the automatic con- 
trol. On the other hand, even in nonparametric Bayes cases, the optimal 
hyperparameter can be found by the generalization loss. In Figure 2.10 (2), 
the average and standard deviation of generalization, cross validation, and 
WAIC errors are compared for 100 independent trials with n = 100. Their 
fluctuations are shown in Figure 2.11 In this case, the optimal hyperparam- 
eter was a = exp(2), because the true distribution g(x) cannot be realized 
by any finite mixtures. If the true distribution can be realized by some finite 
mixtures, then the optimal a would be smaller. 


2.7 Problems 


1. The predictive density of the normal distribution eq.(2.1) is given by 
eq. (2.10), 


- 1 
Poole): ae (CON ENON Ce (2.32) 
where 
Ci = (¢1¢3 — $3)(¢3 + 1)/43, (2.33) 
C2 = $2/d3, (2.34) 
C3 = (d3+1)/2. (2.35) 


Prove that the average and variance of this predictive distribution are given 
by 


fe pla|X")\de = Co, (2.36) 
/ Cy alee = ae ae (2.37) 
by using a formula 
ia de Co Ie. ray 
=o (@= CoP Fae ~ I'(C3) 
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Let us define two estimators of the variance, 


Viayes — DCs — 9/2)" 
pee i 2 
Vink = Bae) 


Prove that, if 6143 > ¢%, then Viayes > Vmi holds. 


2. In the prior of the multinomial distribution eq.(2.13), let us assume that 
ee , 4; = A, where A > 0 is a constant. Then the cross validation loss C;, 
in eq.(2.17) as a function of the hyperparameter a = {a;} is minimized if 
and only if 


(n;/n) 
J 
2 (sl) Tran NFA) 


(2.38) 


is minimized. Also prove that C,, is minimized if and only if 


by using the fact that eq.(2.38) is Kullback-Leibler divergence between two 
probability distributions. On the other hand, prove that the generalization 
error using eq.(2.19) and the true entropy, 


N 
W045 
Gn, — 8 = S~ wo; log ———7—___ (2.39) 
ZF (nj +4j)/(n + A) 


are minimized if and only if aj = woj(n + A) — nj. Note that @; and aj are 
the hyperparameters that minimize the cross validation and generalization 


losses, respectively. The standard deviations of @; and aj are in proportion to 
1/\/n and \/n, respectively, resulting that 4; > 1+(A—N)wo; in probability, 
whereas, a; does not converge even if n — oo. Note that the random variable 
G,, — S is different from the average generalization error E[G,,] — S. 


3. In the linear regression problem, two approximation methods of eq.(2.25) 
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for a given function f(a,s) are defined by 


f (ak, Sk); 


al 


Ma iM> 


& 
Ss 
| 
ale 


i, f(a, s4)p(a|s, X", ¥”)da, 


> 
ll 
mn 


where {a,} and {s;} are independently taken from eq.(2.23) and eq.(2.24) 
respectively. Prove that E,[f] and E2[f] have the same average and that 
the variance of E,[f] is not smaller than E2[f]. In other words, the partial 
posterior integration makes the variance smaller. 


4. Let f(z,w) be a function of a neural network given in eq.(2.26). Then 
a function w +> f(x,w) is not one-to-one, showing that the posterior dis- 
tribution does not concentrate on any local parameter region. Therefore, 
even if the posterior distribution is precisely obtained, the average param- 
eter E,,[w] is not an appropriate estimator. Discuss the reason why DIC 
cannot be applied to such statistical models. 


5. Let us study model selection problems of a normal mixture given by 
eq.(2.28). Let A(n) be the optimal number of components in the set 
{1,2,...,co} that minimizes E[G,,] for a given sample size n. Discuss the 
behavior of K(n) if the true distribution is one of the following densities. 


ey = 5N(2l(02,0.2)) + N(al(04, 0.4), 
wl) = TN(el(-1,-1) + _yNClLD). 
er = aN (1|(0.2k, 0.24)). 

k=1 


The probability density functions qi(a#) and q2(xz) consist of two normal 
distributions. However, they are almost same as one normal distribution, 
hence AK (n) = 1 for not so large n. If n is sufficiently large, then K(n) = 2. 
In the case q3(x), the true distribution is not contained in any finite mixture 
of normal distributions. In this case K(n) slowly becomes larger when n 
increases. 
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Chapter 3 


Basic Formula of Bayesian 
Observables 


In this chapter, we introduce the basic Bayesian theory. For an arbitrary 
triple of a true distribution, a statistical model, and a prior, the behaviors of 
the free energy or the minus log marginal likelihood, the generalization loss, 
cross validation loss, training loss, and WAIC are derived by the following 
procedure. 

(1) Firstly, we define the formal relation between a true distribution and a 
statistical model. 

(2) Secondly, definitions of Bayesian observables and their normalized ones 
are introduced. 

(3) Thirdly, the cumulant generating function of the Bayesian prediction is 
defined. 

(4) And lastly, the basic theory of Bayesian statistics is proved by using the 
cumulant generating function. 

At the end of this chapter, we show the recipe for the Bayesian theory 
construction and its application. In this chaper, we assume that a sample is 
taken from an unknown true distribution and that a statistical model and a 
prior are arbitrarily fixed. 


3.1 Formal Relation between True and Model 


In this section, we define several formal relations between a true probability 
density q(x) and a statistical model p(z|w). 


Definition 4. (Realizability) Let W Cc R® be a set of all parameters. If 
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there exists wo € W such that q(x) = p(a|wo), then q(x) is said to be 
realizable by a statistical model p(z|w). If otherwise, g(x) is unrealizable. 
For a given pair q(x) and p(a|w), the set of true parameters is defined by 


Woo = {w EW ; q(x) = p(z|w) for arbitrary x s.t. q(x) > O}. 
By the definition, g(x) is realizable by p(x|w) if and only if Woo is not the 


empty set. The set Woo is equal to the set of zeros of the Kullback-Leibler 
distance, 


Wo={wEew; [oo log te) ae =O}. 


If Woo is not the empty set, then for an arbitrary wo € Woo, p(a|wo) repre- 
sents the same probability density function q(a). However, derived functions 


(sa) loe(oln) 


w=wo 


may depend on the parameter wo in Woo. 


For a true probability density function g(x) and a statistical model 
p(x|w), the average log loss function is defined by 


L(w) = - fa) log p(x|w)da. (3.1) 


It follows that 


a(t) | 


ze) FC 


= | a(a)toga(n)ae + f 4(2) 08 
S + K(q(2)||p(elw)), 


where S is the entropy of the true distribution and K(q(x)||p(a|w)) is the 
Kullback-Leibler distance from q(x) to p(a|w). If q(x) is realizable by a 
statistical model, then the average log loss function is minimized if and only 
if w € Woo, and its minimum value is equal to the entropy of the true 
distribution. 


Definition 5. (Regularity) For a given pair of g(x) and p(z|w), let Wo be 
the set of minimum points of the average log loss function L(w), 


Wo ={weW; L(w) = min L(w’)}, 
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which is called the set of optimal parameters for the minimum average log 
loss function. If Wo consists of a single element wo and there exists an open 
set U such that wo € U C W and if the Hessian matrix V?LZ(wo) at wo 
defined by 
OL 
V?L(wo)) = (<=) 3.2 

( (wo) a OwjOw; (wo) ( ) 

is positive definite, then q(x) is said to be regular for p(x|w). 


By the definition, Wo is equal to the set of minimum points of the 
Kullback-Leibler distance K(q(x)||p(a|w)). If W is a compact set and if 
L(w) is a continuous function, then L(w) has a minimum point, hence Wo 
is not the empty set. A true probability density g(a) is realizable by a 
statistical model p(x|w) if and only if 


Woo = Wo. 


In general, Wo may contain multiple elements. If a true density is unrealiz- 
able by p(z|w), then there may exist w ,w2 € Wo which satisfy p(z|w1) 4 
p(x|we). 


Definition 6. (Essential uniqueness) Assume that Wo is not the empty set. 
If there exists a unique probability density function po(x) such that, 


for arbitrary wo € Wo, p(xz|wo) = po(2), 


then it is said that the optimal probability density function is essentially 
unique. 


If g(x) is realizable by p(x|w), then the optimal probability density func- 
tion is essentially unique, because po(x) = q(x). If Wo consists of a single 
element, then the optimal probability density function is unique and essen- 
tially unique. 


Let us study several cases using examples from the viewpoints of realiz- 
ability, regularity, and uniqueness. 


Example 11. (Realizable, regular, and unique) A true probability density 
function q(x) and a statistical model p(x|a) are defined by 


ae) = se es(- 52) 


Fe exp(—5(c _ ay), 


p(2|a) 
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In this case, the Kullback-Leibler distance is given by 


1 
K(a) = x 
If W = {a;-1 <a< 1}, then 
Woo = Wo = {a; a=O0} 
and the Hessian matrix of the average log loss function is 


VL) p25 = VK @) lee = 1: 


Hence q(x) is realizable by and regular for a statistical model p(z|a). If 
W ={a;1<a< 2}, then 


Woo = 2, Wo =103 ¢=1h. 


Hence q(x) is unrealizable by and nonregular for a statistical model p(z|a). 


Example 12. (Realizable, nonregular, and essentially unique) A true proba- 
bility density function q(y|x)q(x) and a statistical model p(y|x, a, b)q(x) are 


defined by 
_ fl (als) 
Cee re 


q(ylz) = sae ex(-5H"). 
plylesa,8) = —=exr(—5(y —asin(ba))?). 


The set of parameters is defined by 
W = {(a,8) ; Ja] <1, [0] < 1/2}. 


Then the Kullback-Leibler distance is given by 


ae ff 
K(a,5) = =f sin(br)*dzx. 


Therefore 
Woo = Wo = {(a, b) EW; ab= O}. 


The sets Woo and Wo consist of multiple elements. Hence q(y|x)q(x) is 
realizable by and nonregular for 

p(y|x, a, b)q(x). In this case we also say that the conditional density q(y|x) 
is realizable by and nonregular for p(y|z, a, b). 
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Example 13. (Unrealizable, regular, and essentially unique) A true proba- 
bility density function q(y|x)q(x) and a statistical model p(y|x,a,b)q(x) are 
defined by 


_ l. O<¢< 1) 
an). = { 0 (otherwise) 
1 1 


q(yle) = Jag Pg - 2); 
pluylasa,b) = =exp(—F(y— aa)?). 


The set of all parameters is defined by 
W = {(a,b) ; Jal, |b] < 1}. 


Then the Kullback-Leibler distance is equal to 


Therefore 
Woo = 8, Wo= {a; a = 3/4}. 


and 
V*L(a)|q=3/4 = V7K (@)|a=3/4 = 1/3. 


Hence q(x)q(y|x) is unrealizable by and regular for q(x)p(y|z,a,b). Also 
q(y|x) is unrealizable by and regular for p(y|z, a, b). 


Example 14. : (Unrealizable, nonregular, and essentially unique) A true 


probability density function q(x,y) = q(x)q(y|x) and a statistical model 
p(x, ya, b) are defined by 


a(e.y) = srexp(-5{o? +9"), 
p(z,yla,6) = 5 exp(—5{(a-1)? + (y ~asin(bz))?}), 


where the set of all parameters is 
W = {(a,8) ; Ja] < 1, |B] < 7/2}. 


Then 
1 az lee) ; " 
K(a,b) = ao sin(bx)*q(x)dz. 
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Hence 
Woo = 8, Wo = {(a,b) ; ab = O}. 


It follows that g(x,y) is unrealizable by and nonregular for a statistical 
model p(x, y|w), however, the optimal model is essentially unique. 


Example 15. (Unrealizable, nonregular, and essentially nonunique) A true 
probability density function g(x,y) and a statistical model p(z, y|a,b) are 
defined by 


a(e.y) = sn exp(-5{o? +9), 
p(z.yl0) = s-exp(—5{ (x —cos8)? + (y ~ sind)*}), 


where the set of all parameters is 
W={0; -7<0<7}. 


Then K(@) is a constant function of 0, 


Therefore 
Woo = 2, Wo = {6; —1<60<7}=W. 


For an arbitrary 09 € Wo, the Kullback-Leibler distance K(q(x)||p(x|@)) is 
equal to a constant 1/2, however, if 0; 4 62 then p(x,y|01) A p(x, y|02). 
Therefore q(x) is unrealizable by and nonregular for a statistical model. 
Moreover, the optimal density is essentially nonunique. 


Definition 7. (Relatively finite variance of log density ratio function) Let 
W and Wo be sets of parameters and optimal parameters for the minimum 
average log loss function, respectively. For a given pair wo € Wo and w € W, 
the log density ratio function is defined by 


bie WO; w) = log ra ae (3.3) 


If there exists cg > 0 such that, for an arbitrary pair wo € Wo and w € W, 


ix [f(X, wo, w)] > coOEx[f(X, wo, w)], (3.4) 


then it is said that the log density ratio function f(x, wo, w) has a relatively 
finite variance. 
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Remark 16. The function f(x, wo, w) has a relatively finite variance if and 
only if 


cab, Ex[F(Xwo,w)) © O° (3.5) 


If W and Wo are compact sets and if Ex[f(X, wo, w)] and Ex|f(X, wo, w)’] 
are continuous functions, then both functions have finite values. Hence 
the condition w ¢ Wo in the supremum of eq.(3.5) can be replaced by the 
condition that w is contained in a neighborhood of of Ex[f(X, wo, w)] = 0. 
In other words, the condition w ¢ Wo can be replaced by 


w € {w ¢ Wo; Ex[f(X, wo, w)] < ef, 


for an arbitrarily small € > 0. 


Example 16. Let us study a case given in Example.12. The log density ratio 
function is 5 aac 
in“(b 
f(@,y,0,5) = —yasin(ba) + 22 2) 


yo 
Therefore 
a2b2 1 
Sar lftewad)) = Sf %(0r)eae, 
0 
ul a*b* 1 
Ecxyylf(a,y,@,b)"] = a?b? | S? (bu) 2°dx + = | S*(bx)a*de, 
0 0 


where S(x) = sin(x)/x with S(0) = 1. It follows that, in the neigh- 
borhood of ab = 0, there exists co > 0 such that Eryx y)[f(a,y,a,b)] > 
coEcx,yy lf (x,y, 4, b)?]. Hence f(x,y,a,b) has a relatively finite variance. 


(=) 


Lemma 3. Assume that wo € Wo andw € W. If f(x, wo,w) has a relatively 
finite variance, then the optimal probability density is essentially unique. 


Proof. Assume that w; and wy are arbitrary elements of Wo. Then 


0 = L(w2) — L(wi) = [o@)F(e,u1,ua)de 


IV 


CO / q(x) f (x, wi, we)?dax. 


Hence f(x, w1, w2) = 0 for an arbitrary 2, resulting that p(x|w1)—p(z|w2) = 
0 as a function of x. oO 
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By Lemma 3, if f(z,wo,w) has a relatively finite variance, then the 
optimal probability density function is essentially unique and f(x, wo, w) 
does not depend on wo. In such a case, we use a simple notation 


f(x, w) = f(x, wo, w). 


By the definition of f(x,w), it follows that 


p(2|w) = po(#) exp(—f(#, w)). 


Note that if the optimal probability density is not essentially unique, then 
the log density ratio function does not have a relatively finite variance. The 
following lemma shows that if a true density is realizable by a statistical 
model and if the tail probability satisfies a condition, then the log density 
ratio function has relatively finite variance. 


Lemma 4. Assume that W is a compact set and that q(x) is realizable by 
p(a|w) and that the log density ratio function f(x,w) = log(q(x)/p(a|w)) is 
a continuous function of (x,w). If there exists cy,co > 0 such that for an 
arbitrary w © W, 


: 2 
Fess q(x) f(a, w)*dx < a | q(x) f (x, w)?dx, 


|z|<er 
then f(x,w) has a relatively finite variance. 


Proof. Since q(x) is realizable by p(x|w), the optimal probability density is 
uniquely equal to q(x). A function F(t) (—co < t < oo) is defined by 


F(t) =t+e*-1. 


Then F’(t) = 1—e~* and F"(t) = e* > 0, resulting that F(t) > 0 and 
F(t) = 0 if and only if t = 0. The constants c3 and cy are defined by 


cz = sup sup |f(x,w)|, 
weW |x\<c1 

c, = inf F(t)/t?. 
lé|<ca 


Then cz and cq are positive and finite values. Since 


a(t) ) _ a() 
a(2)F (log oT 5) = a(e) los Sor + lel) — al) 
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it follows that 
[aepon tae =f aar(sew)yd 
/ HOP ew) 


IV 


= a I... a(x) f(x, w)Pde 


C4 
1+co9 


| oe) Fle. wax, 


which completes the lemma. O 


The following lemma shows that, if a true density is regular for a statis- 
tical model, then the log density ratio function has relatively finite variance. 


Lemma 5. Assume that W is a compact set and that for an arbitrary pair 
wo € Wo and w € W, the second derivatives of 


[ @)Fle, v0, w)de, [@)F(e, v0, w)Pade 


are continuous functions. If q(x) is regular for p(a|w), then the log density 
ratio function f(x,wo,w) has a relatively finite variance. 


Proof. Since W is a compact set, both functions 


[ @)F(e, 0, w)de, [e@)F(e, v0, w)Padz 


have nonnegative values. By the definition, 


L(w) ~ L(wo) = f ale) f(e, wo, w)de. 
By the assumption that q(x) is regular for p(z|w), L(w) — L(wo) = 0 if and 


only if w = wo. Hence it is sufficient to prove that there exists « > 0 such 
that, in the region |w — wo| < €, 


[@)Fle, 0, w)de > cr f ale) (e, wo, w)Pde 


for some cy > 0. By the regularity condition and the mean value theorem, 
there exists «* such that 


i= = 5(w BAC ee 
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in a neighborhood of w = wo, hence there exists fz; > 0 such that 


[e@)F(e, 0, w)de > palo — wo)? 


in a neighborhood of wo. On the other hand, 


[ a0) fle, 10,0)Pae 


is a nonnegative function and is equal to zero at w = wo. Therefore, there 
exists 2 > 0 such that 


[FC w0,w)%de <pellw = wo)? 
in a neighborhood of wo. It follows that 


sup Ex(f(X, wo, w)”] 
wktWo Ex|f(X, wo, w)| 


< 00, 


which completes the lemma. O 


Summary Assume that the set of all parameters W is compact. Then the 
above lemmas show the following relations, 


{Regular} C {Relatively Finite Variance}, 
{Realizable} C {Relatively Finite Variance}, 


and 
{Relatively Finite Variance} C {Essentially Unique}. 


In this book, we mainly study cases when the log density ratio functions have 
relatively finite variances. It should be emphasized that such cases include 
nonregular cases, hence the conventional statistical asymptotic theory does 
not hold in general. 


Example 17. The foregoing examples are classified into the following cases. 
e Example 11: Realizable and regular + Relatively finite variance. 
e Example 12: Realizable > Relatively finite variance. 


e Example 13: Regular > Relatively finite variance. 
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e Example 14: Nonregular, nonrealizable, but relatively finite variance. 
e Example 15: Essentially nonunique —> Relatively infinite variance. 


In this book, we show that if a log density ratio function has a relatively 
finite variance, the free energy or the minus log density ratio function Fy, 
the generalization loss G,,, the cross validation loss C;,, the training loss Ty, 
and WAIC W,, are subject to the universal statistical laws. That is to say, 
there exist constants A,v,m > 0 such that 


Fr, = nLDy(wo) + Alogn+ (m — 1)Op (log log n) + O,(1), 


EIG,] = L(wo) + A/n + o(1/n), 

Cr] = L(wo)+A/n+ o(1/n), 
BIW] = L(wo) + A/n+ o(1/n), 

E/T] = L(wo) + (A — 2v)/n + o(1/n). 


Moreover, by defining 
1 n 
Ln(wo) = —— ) log p(Xi|wo), 
i=1 


the behaviors of random variables satisfy 


Grn — L(wo) + Cn —LIn(wo) = 2A/n+o,(1/n), 
Gn — L(wo) + Wr -— Ln(wo) = 2A/n+0,(1/n). 


Note that these mathematical laws hold even if the posterior distribution 
is quite different from any normal distribution. However, if the log density 
ratio function does not have a relatively finite variance, then such statistical 
laws do not hold in general. 


3.2 Normalized Observables 


In this section, we introduce normalized observables. A triple of a true 
probability density, a statistical model, and a prior, (q(x), p(z|w), y(w)), is 
fixed. 

Let X” = (X1, Xo,..., Xn) be a set of random variables which are inde- 
pendently subject to a true probability density function q(x). Also let X be 
a random variable which is subject to the same density q(x). Assume that 
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X and X” are independent of each other. The average and empirical log 
loss functions are 


L(w) = —Ex[logp(X|w)], (3.6) 
In(w) = = “log p(X:lw). (3.7) 
i=1 


Let Wo be the set of optimal parameters for the minimum average log loss 
function L(w). That is to say, Wo is the set of all parameters which make 
L(w) smallest. We assume that the log density ratio function of wo € Wo 
and w € W 

p(x|wo) 

p(x|w) 

has a relatively finite variance. By Lemma 3 the log density ratio func- 
tion f(x,wo,w) does not depend on wo, hence we simply write f(z,w) = 
f(x, wo,w). The normalized average and empirical log loss functions are 
respectively defined by 


f(x, wo, w) = log 


K(w) = Ex(f(X,w)], (3.8) 


~S> f(Xi,w). (3.9) 
14=1 


x 
= 
I 


w) = —logpo(x) + f(x, w), hence 


L(wo) + K(w), 
Ly(wo) ote Kn(w), 


By the definition, — log p(x 


os 
ee 
| | 


and K(w) > 0. Moreover, 
K(w) =0— we Wo. 


The normalized partition function or the normalized marginal likelihood is 
defined by 


ZO) = [ere nkw)\e(w)aw, 
Then 


[[e(%lw) = (Teste) eso(-n Kn(w)) 
1=1 


and 
Zn = exp(—nLn(wo)) » ZO). 
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Since L,,(wo) is a constant function of w, the posterior distribution can be 
rewritten as 


" 1 
p(w|X") = go eee 


n 


The normalized free energy or the normalized minus log marginal likelihood 
is defined by 


FO) = tog | exp(—nKy(w))ew)dw. 


The normalized generalization, cross validation, and the training losses and 
normalized WAIC are also defined by 


CO. = AS" wem en rOG.wy)), 
nr 
i=l 
1 n 

TO) — == Sloe Ewlexp(—f(X4, w))], 

i=l 
1 nm 
9 —~ pO) 1. = X; 
wi i +o Vall ip w)]; 


where E,,[ |] and V,,[ | are the posterior average and variance, respectively. 
Here Go. CO). (0) and Ww are sometimes called generalization, cross 
validation, training, and WAIC errors, respectively. 


Lemma 6. The Bayesian observables and the normalized observables have 
relations, 


Q 
3 

I 
a 
E 
S 
+ 
Q 

S 


Proof. By the definition and 


— log p(x|w) = — log po(x) + f(a, w), 


this lemma is derived. O 
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Remark 17. The Bayesian observables are used in analysis in practical ap- 
plications, whereas the normalized observables are useful in Bayesian theory 
construction, because they are mathematically essential quantities. 

In the following chapters, we will show that, if a log density ratio function 
has a relatively finite variance, then FO ins Go). CO), To). and W(°) 
converge to zero in probability, therefore 


F,/n = Ln(wo)+ Op(log n/n), 
Gin L(wo) + Op(1/n), 
Ch Ln (wo) + Op(1/n), 
Tr = Ln(wo) + Op(1/n), 
W, = Ln(wo)+Op(1/n). 


By the central limit theorem, 
Ln (wo) — L(wo) = Op(1/V7). 


Neither L,(wo) nor L(w,) depends on a prior. F;,/n, Gn, Cn, and T, con- 
verge to L(wo) when n — oo. If a true distribution is realizable by a 
statistical model, then p(x|wo) = q(x) and 


F,/n = Sp+Op(logn/n), 
G,z = 8+ 0,1), 
Cy = Sp +O,(1/n), 
Tx = Ba Onl); 
Wy = Sat Oy), 


where S and S,, are the entropy and the empirical entropy of the true dis- 
tribution respectively. Neither S nor S, depends on a statistical model and 
a prior. Thus the main purpose of the mathematical theory is to clarify the 
random behaviors of the normalized observables. 


3.3. Cumulant Generating Functions 
In order to study the asymptotic behaviors of the generalization loss, the 


cross validation loss, the training loss, and WAIC, the cumulant generating 
functions are useful. 
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Definition 8. Let a be a real value. The cumulant generating functions of 
generalization and training losses are respectively defined by 


Gr(a) = Ex|logE,,[p(X|w)"]], (3.10) 
Tr(a) = ~ Slog Ew|[p(X;|w)°]. (3.11) 
t=], 


The kth cumulants are defined by 


(h)'au, ()'r0 


Remark 18. Since the definition of the posterior average E,,| | depends on 
X”, the average operations E and E,, do not commute, EE, 4 E,,E. Hence 


E[Gn(a)] A E[Tn(a)]- 


By the definition, 
Gn(0) = Tr(0) = 0 


and the generalization, cross validation, training losses and WAIC are given 
by 


SAH 
\| 
| 
S 
See 


If we obtain cumulant generating functions as functions of a, then it is easy 
to calculate the generalization loss, the cross validation loss, the training loss, 
and WAIC. By using Taylor expansion, the cumulant generating functions 
are reconstructed by kth cumulants, 


Ke) 

3 

= 
] 


! a? ” ae m 
/ a? " ae my 
Tr(a) = aF,(0) + an (0) + Gin (Oy eat, 
if these expansions converge absolutely. Even if these series do not converge 


absolutely, the asymptotic expansions of G,(a@) and 7;,(a@) can be derived in 
many cases by the higher order mean value theorem. 
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Definition 9. Let a be a real value. By using the log density ratio func- 
tion f(x,w) which satisfies p(x|w) = po(x) exp(—f(z,w)), the normalized 
cumulant generating functions of generalization and training losses are re- 
spectively defined by 


Ga) = ExflogE,lexp(—af(X,w))]], 


Th (a) Ewlexp(—af (Xi, w))]. 


| 

SIH 

18 
0 


From this definition, simple relations between cumulant generating func- 
tions and normalized ones are derived. By using p(a|w) = po(x) exp(—f(z, w)), 
where po(xz) does not depend on w, 


log Ew [p(X |w)*] = a log po(X|w) + log Ey [exp(—af (X, w))]. 


Therefore, 


Gn (a) = —aL (wo) a Gg) (a), (3.12) 
Tr(a) = -aLn(wo) + Th (a). (3.13) 
Hence, for k = 0,2,3,4,... (k £1) the kth cumulants satisfy 


£)"G,0) = (£)"o@, 
da 


()'r.00) = (£)'7@. 


For k = 1, 


—— 
Q 
VS 
Ke) 
3 
(=n) 
II 
| 
= 
€ 
Nae 
+ 
—— 
S| 
NY 
Ke) 
8S 
SCS 
= 


(£) 7.0) = —En(wo) + (2) 70). 


Definition 10. Let a be a real value. For a given random variable A, we 
use notations, 


Ew [(log p(Alw))*p(Alw)®] 

A Bop Ar] 
(0) ee Ew ((—f(A, w))* exp(—af (A, w))| 

Me AD a eee Fal 


If a = 0, these are equal to the posterior averages of kth power, 


6(A) = Eu[(log p(Alw))*], 
(O(a) = Eyl(—f(A,w))*}. 
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Lemma 7. The cumulant generating functions satisfy 


Gr(a) = Ex[4(X), 


Gia) = Ex(l(X)- (x)? 
Tila) = — Steal} 
i=1 


Tila) = = {6(%) - 4 (%)} 
i=1 


The same equations hold for normalized functions, 


Ga) = Ex), 
Ga) = Exl6(X)- 4x), 


1 n 
Ta) = ee 


FO"(a) = Daa meeas 


Proof. For an arbitrary function g(a), let O*g(a) be the kth order derived 
function of g(a). Then 


log g(a) = o9a) : 
g(a) 
and 
a(29)) _ (ae) _ (a) ees 
g(a) g(a) gla) 7 g(a) 
By using this equation recursively, the lemma is obtained. O 


Remark 19. By the same method, the higher order cumulants can be derived. 
For example, 


Git(a) = Bxlls(X) ~ 3l2(X)4(X) +24 (X)}, 
gi(a) = Bxlla(X) — 4(X)4(X) — 3X) 
+12l5(X)ey(X)? — 60,(X)*). 
Talla) = — > Xea(Xi) — 3eo(Xi)ea(%) + 261(%i)*}, 
w=1 


Ta) = ~S{ta(X) — 403(X4)£1 (Xi) — 3l2(X)? 


+1205 (X;)e1 (xy — 64; Cay 


84 


CHAPTER 3. BASIC FORMULA OF BAYESIAN OBSERVABLES 


For the normalized case, the same equations hold by replacing ¢,(A) by 


£P)(A), 


By using these equations, cumulants are given by follows. For k = 1. 


Ew [log p(X |w)]] 


For k = 2, 


Gg" (0) 


T" (0) log 


Ww 


se 


where V,,|f(w)] is the variance of f(w) in the posterior distribution. 


(log p(X|w))?] — 
f(X,w)?] - 
f(X,w)]], 


f(Xi,w)7] - 


Ew [A (w)], 


Hy Kn (w)]. 


Ewllog p(X |w)]*] 
Ewlf(X, w)/*], 


p(X;|w))?] — Ey [log p(Xi|w)]?} 


Ewlf (Xi, w)]?} 


F(Xi, wy], 


26. Then for k = 2,3,4, 


Lemma 8. Let co = 2, c3 = 6, 4 
(<)'Guia)| < aks 
ne] = oh 


Ewlexp(—af (X, w))] 


iP 


wl 


Ewlexp(—af (Xi, w))] 


wll f (Xi, w)|* exp(—af (Xi, w))] 


(3.14) 


(3.15) 


(3.16) 


(3.17) 


Proof. For an arbitrary random variable A and an arbitrary function g(A, w), 


EMT ] is 


an expectation operator 


MS [9(A, w)] = 


defined by 


Ew(g(A,w) exp(—af(4,w))} 


Ewlexp(—af (A, w))] 
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Let us prove the first inequality for k = 3. 


(=) Sula) = Ex [EQ (f(X,w)") 


—3Ef) [f(X, w) JED [f(X, w)] + 2ES) (F(X, w)]). 


Then by using Holder’s inequality, for arbitrary 1 <j < k, 


EO) f(A, w)?] SEOUL (A, w) PY. 


By applying this inequality, it follows that 


(EZ) Gala) < 6Bx ROU, w)/]. 


For the other cases, the same method can be applied, which completes the 
lemma. im 


3.4 Basic Bayesian Theory 


By combining the foregoing observables, we obtain the basic theorem of 
Bayesian statistics. 


Theorem 3. (Basic theorem) Let G’(0), T’(0), G’(0), and T"(0) be random 
variables defined by eq.(3.14), ...,eq.(3.17). Assume that 


op (<)'Gu(a)| = of), (3.18) 
op (+)'tala)] = op(=). (3.19) 


Then the generalization loss, the cross validation loss, the training loss, and 
WAIC are given by 


Gn = —Gn(1) = -G4(0) — 59 (0) + op) (3.20 
Tr = ~Ta(l) = -T4(0) ~ 3720) + op(—). (3.21) 
Cn = Tall) = ~T4(0) + 5TH"(0) + op(—) (3.22) 


Wn = ~To(1) + FeO) = -T2(0) + 5F2"(0) + op(=). (3.23) 


n 
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Assume that 


[sup (+)'s.(a)}I zs (=), (3.24) 
z{sup|()'Tla)|] = of). (3.25) 


|a|<1 


Then the averages of the generalization loss, the cross validation loss, the 
training loss, and WAIC are given by 


EIG,] = —EIG,(0)] - 5E(G%(0)] + o(-), (3.26) 
E[T,]) = —EIT(0) — SEIKO) +0(=). (3.27) 
E[Ch] = —E(T;(0) + SEIT) +0(=), (3.28) 
E[Wa] = —-E[T/(0) + SEIT] +0(=). (3.29) 


Proof. By using the mean value theorem, for a given a there exists a* such 
that |a*| < |a| and that 


Gn (a) = Gnl(0) + 0G),(0) + 50°GM(0) + 20°G() (0°). 


The case a = 1 gives the first half of the theorem. The latter half is derived 
by the same method for 7;,(a@) and a = +1. O 


The conditions given by eqs.(3.18), (3.19), (3.24), and (3.25) are proved 
in the following chapters for several circumstances. Moreover, we will prove 
that if the log density ratio function has a relatively finite variance, there 
exist constants A,v > 0 such that 


(0)/ (0)/ ay. ees = 
Gr (O)+ 7, (0) n + Op(—), 
Ong = Bangs 
G0) = ~ +op(-), 
Ong = Yao 
TM) = + op(-), 
and that their averages satisfy the equation 
(0) (0) 
af gO) 4 2% _O] = gf org me “O 7 
| g2”"(0) + = | | 7,00" 0) - += | +o) 
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It is very important that these equations hold even if the posterior distri- 
bution cannot be approximated by any normal distribution. These equa- 
tions are universal laws in Bayesian statistics. Also these equations except 
eq.(3.22) and eq.(3.28) hold even if a sample consists of conditionally inde- 
pendent random variables. Note that if a sample consists of independent 
random variables, then 


(Cn) = E|Gn_:]: 


However, this equation does hold without independency. The cross valida- 
tion can be employed in the case when a sample is independent. 


Based on the above theorem, for a given triple (q(x), p(z|w), y(w)), 
Bayesian theory can be derived by the following recipe. The theoretical 
behaviors of the free energy or the minus log marginal likelihood, the gen- 
eralization loss, the cross validation loss, the training losses, and WAIC are 
clarified by the following procedures. 


Recipe for Bayesian Theory Construction 


1. An arbitrary triple of a true distribution, a statistical model, and a 
prior (q(x), p(a|w), y(w)) is chosen and fixed. The set of all parameters 
is denoted by W. Assume that X” is a sample which consists of 
independent random variables subject to q(z). 


2. The empirical and average log loss functions are defined by 


In(w) = -— “log p(Xilw) 
i=l 
tas = 7 (e) log p(a|w)de. 


Find the set of optimal parameters which minimize L(w), 


Wo ={weEW; L(w) = min L(w’)}. 
w'eWw 


3. Check that the log density ratio function made of q(x) and p(z|w) has 
a relatively finite variance. Then 


f(z, w) = 


does not depend on a choice of wo € Wo. 


88 


CHAPTER 3. BASIC FORMULA OF BAYESIAN OBSERVABLES 


. Define the average and empirical log likelihood ratio functions 


ts 
= 
t 


/ f(a,w)q(a)de, 


= 
S. 

II 
JR 
iJ 
> 

x 
oy 


The normalized partition function or the normalized marginal likeli- 
hood is given by 


ZO) = [eo -nkn(w)ye(w)dw. 
Then the free energy or the minus log marginal likelihood is given by 


F, = nLn(wo) — log Z. 


. The average by the posterior distribution is equal to 


Ew| |= JC exp(=nKn(w))e(w)dw 
‘ fexp(—nK,,(w))p(w)dw 


Calculate E,,| f(z, w)] and V»[f(x, w)]. Then 


Bewy [Kk (w)] = ExE, [f(X, w)], 

Ew[Kn(w)] = me swLf (Xi, w)), 
7= 1 

Ex Vw $Y (Xj, w)| 


are obtained. 


. Based on the basic Theorem 3, the generalization, cross validation, 


and training losses are given by 


Gn = E(w) + Bulk (w)] - SExVulf(X,w)] + op(—), 


Cy = En(wo) + Bul Kn(w)] +5 7 Vol F(X, w)] + op(), 


Tn = In (tty) + BwlKn(w)] — 5~Y > Vulf(Xist)] + op(). 


Note that WAIC, W,,, has the same expansion as Cy. 
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Example 18. Let us apply the recipe to a simple case. 


1. Let A > 0 beaconstant and W = {w € R;|w| < A}. Let us derive the 
Bayesian statistical theory for the case when a triple (q(x), p(x|w), p(w)) 
is given by 


az) = Ts? ( so2) 


x—w) 

p(aiw) = eexr(-S5"*), 
1 
p(w) = 3A’ 


where o? > 0 is not a parameter but a constant. Note that this true 
distribution is realizable by a statistical model if and only if 0? = 1. 


2. The empirical and average log loss functions are respectively given by 


1 1 2 
lalw) = 5 log(27) + 5 (Fn sp Jaen): 
1 1 
iy) = a log(2m) + 5 (7 +w’), 
where 
1 n 
1=1 


1 
The random variance o2 converges to a? in probability. The random 
variable €, is subject to the normal distribution whose average and 
variance are zero and o? respectively. The average log loss function 
L(w) is minimized if and only if w = 0, resulting that Wo = {0}. 


3. The log density ratio function is 


p(z|wo) _ w? 


f(z,w) = log = — — we, 
ony pew) 2 
where we used wo = 0. It follows that 
2 
w 
Ex(f(Xw)) = 5, 
ast 
ix[f(X,w)?] = + w*o?. 


4 
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Thus f(z,w) has a relatively finite variance by using eq.(3.5) and 
|w| < A. 


. The average and empirical log likelihood ratio functions are respec- 


tively given by 


Kw) = f fe.wa(e)ae =, 
Cia = 1S Kx Fe ie icy 
, 0 =i " ae 


The normalized partition function or the normalized marginal likeli- 
hood is equal to 


1 A 2 
IO = a exp(— 22 + Vinw &,)aw 


7 af ae Ss dw + Ole n) 
- rane /2) + O,e). 


Therefore the theoretical behavior of the free energy or the minus log 
marginal likelihood is derived as 


F, = nL,(wo) — log Zn 


—1 
= — log (2m) + “22S a = fh 


5108 n+ log(2A) + op(1). 


Its average is given by 


— ill 1 
EF] = 5 log(27) + 5 (no + logn — 1) + log(2A) + o(1). 


(0) 


shows that the posterior distribution is 
asymptotically given by 


pwlx") = \/E exp(—Ztw— 2 )") + 0,06. 


Let E%,| | and V*[ | be the average and variance operators using the 
normal distribution whose average and variance are €,,/\/n and 1/n, 
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respectively. Then 


zu] = 2 
WwW Jn’ 
2 
eylwt] = 2, 
a [w) = On) (k > 3), 
resulting that 
wwlf(e,w)] = ESlf(e,w)] + Op(e) 
1+& — ®&n 
an ~~ ig ca 
a 
Vulf(e,w)] = VELs(e,w)] + Ope) = = + Op(1/n8), 
Hence 
2 
Sy (K (w)] = ExEw[f(X,w)] = 25" + op(1/n), 
n 2 
wlKn(w)] = = S7 Bulf (Xi, w)] = + op(1/n), 
w=1 
o2 
Ex Vul[f(X, w)| = — +0,(1/n), 


6. By using the results above, and 02 = 0? + O,(1//n), we obtain the 
Bayesian statistical theory, 


1 e. 14 65¢ 1 
Gy = 5 les (2m) + ee + 75 + op(—), 

1 o2 1-€2+0? il 

C. = igs te he Se = 
n = zlog2m) + B+ B= 4 o,(-), 

2 2 2 

os 1-&-o 1 

Tn = = log(2r)+ > + oa + op(—). 
Note that 
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holds. Since E[E?] = o?, their averages are 


2 


i Oo i 
Gy) = 5 log (27) + et int o( 
1 @ . 1 
EICn = 3 log (27) + e + op, + o( 
1 e . base 
EIT] = —log(2 — 
[Tn p BRN) ta, 


\ 
), 


1 
n 
1 
n 


+0(=). 


The random variable WAIC and its expected value, W, and E[W,], 


have the same asymptotic expansions as C,, and 


E|C,,], respectively. 


The average generalization loss is a decreasing function of n, whereas 
the average training loss is increasing. In the following chapters, we 
show that if a log density ratio function has a relatively finite variance, 
then the generalization loss, the cross validation loss, the training loss, 
and WAIC have the same asymptotic behaviors as this case. 


Example 19. If a log density ratio function does not have a relatively finite 
variance, then asymptotic behaviors are different in general. Let us study the 


case given in Example 15. In this case, we cannot emp 
for theory construction. Let (X;,Y;) (¢ = 1,2,...,n) 


loy the above recipe 
be pairs of random 


variables which are subject to g(x,y) in Example 15. We use notations, 


1 

En = oe 
1 

In = Fai 


The empirical entropy is 


1 n 
Sp = —— ) log q(Xi, Yi) = log(2m) + 
I 


2 
Tn 


9° 
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The partition function is 


T 


en exp(V/nEn cos 6 + V/nnn sin 0)d0 


(Qn)n¥i 


_  xp(=n(rn + 1)/2 + Vr) [ exp(Vn7jn(cos 6 — 1)))dd 


(Q7r)rt+1 


where we used 7, = \/€2 + 72 and the cyclic condition of 0. If n is suffi- 
ciently large, then the main part of integration [—7, 7] is the neighborhood 
of the origin. By using 


1 —cos@ = —67/2+ 0(6"), 


[ exe—vinrnt/2)00 = 4] a + oplexe(—VirW)) 


The free energy is 


F, = n(r2+1)/2— Jn + (n + 1/2) log(27) 
1 
ue log n + log Yn + Op(1). 


and 


Since both €, and y, are subject to the normal distribution with average 0 
and variance 1, y2 is subject to the chi-squared distribution with 2 degrees 
of the freedom. Therefore, 


1 [ee] = 

El Yn] — my | ae [dx a V20(3/2), 
‘ 1 [ 2/2 

t, — ———_ d = 2: 


The average free energy is given by 


E[F,] = (3/2 + log(2m))n — V20(3/2),/n + zlogn + O(1). 


By Remark 14 and E[/F,] = E[Gnii] — E[Gn], if E[G,] has an asymptotic 
expansion, then it is equal to 
(3/2) 1 


ors + ie +o(1/n). 


E|Gn] = (3/2 + log(27)) — 
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This equation shows that the average generalization loss is an increasing 
function of n. In this model, the distance from the true distribution to each 
model is constant, hence, if n is small, the posterior distribution is spread 
over all parameters. However, if n is large, then it concentrates on some 
parameter region by the random fluctuation. Such a phenomenon is called 
spontaneous symmetry breaking. 


Remark 20. In this book, we mainly study the case when the log density 
ratio function has a relatively finite variance and show that the free energy 
and several losses are subject to the universal law. It may seem that Example 
19 is a special or pathological exception, but, complicated and hierarchical 
statistical models used in deep learning may reveal the same phenomenon. 
For example, in some regression problem if a true distribution of Z 


Z=exp(=X* =¥7) + N(0;1) 


is statistically estimated by a neural network 


A 
Z = S-ano(bnX + caY + da) +N (0,1), 
h=1 


where o is a sigmoidal function and {ap, bp, cn, dn} is a parameter, then the 
true distribution is unrealizable by and singular for a statistical model. In 
this case, the same phenomenon shown by Example 19 occurs. Spontaneous 
symmetry breaking will be an important theme in Bayesian statistics in the 
future. 


3.5 Problems 
1. Let  € Rand {e,(x);k = 1,2,...} be a set of functions which satisfy 
[ex(e)ee(e)a(e)ae = Oke; 


where if k = @ then dge = 1, and if k # @ then dx¢¢ = 0. Let X be a random 
variable which is subject to a probability density q(x) and {a; 4 0} be a set 
of nonzero real values which satisfy 


(aay? =, 


lege: 


> 
ll 
mn 
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Assume that a conditional density q(y|x) and a statistical model p(y|z, w) 
are respectively defined by 


a Yo aven(X )+N(0, 17), 


Y 


Smeal ) +N(0, 1’). 


Then prove that q(x)q(y|x) is unrealizable by and regular for q(x)p(y|z, w). 
2. Let 6 > 0 be a positive constant. For a given statistical model p(z|w), 


a prior y(w), and a set of random variables X”, a generalized posterior 
distribution of the inverse temperature ( is defined by 


piel X") = pe vw) TT Xo), 
i i=1 


where 


2n(3) = | ow) [[ r(Xi|u)?au. 
a=1 


Let EZ[ ] be the averaged value over p\)(w|X"). Then the generalized 
predictive density is defined by 


)(a|X") = EX [p(2|w)]. 


Therefore, 3 = 1 results in the ordinary Bayesian estimation. The gen- 
eralization loss, the training loss, the cross validation loss, and WAIC are 
respectively generalized by 


G = ~Exllogp(X1X")], 
Te = 7 bose (Xi|X”), 
iG : 1-6 
ce = BS igs ole Xit) k 
| Ee [p(Xi|w) 7] 


6 _ pe, FB Sven 
Wi, + 2 Vellos (al) 
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where Vi[ ] is the variance over p\9)(w|X"). The cumulant generating 
functions are also generalized by 
Ge(a) = Ex[logE}[p(X|w)J], (3.30) 
1 nm 
: = —YS log E? [p(X;i|w)*. S31 
Tr (a) a2 og E,,[p(Xa|w)"] (3.31) 


Assume that eqs.(3.24) and (3.25) hold for Gf (a) and 7,” (a) instead of G,,(a) 
and 7;,(a). Then prove the following equations. 


GE = G80) - 5$(G8)"0) + (=), (3.32) 
TE = ~(TPY(0) — 3(TEY"(0) + op), (3.33) 
8 = (TA O+F TA" +o(2), (334) 
wh = -(TeY(0)+ = crpy"0) + op(—). (8.85) 


These equations show that the universal law of the Bayesian statistics holds 
even if the posterior distribution is given by the inverse temperature ( > 0. 


3. Let us study the multinomial distribution and its prior defined by eq.(2.13) 
and eq.(2.14). Then by using.(2.18), it follows that 


log E,,[p(z|w)*] = log fp(clw)* p(w/X”)dw 


N N 

> log F(ax ) +n +a;)— log h(n +a+ 5 - aj) +e1, 

j=1 j=l 
where c; is a constant function of a. Assume that a true distribution is 
given by p(z|wo). Prove that the kth (k > 1) cumulants are given by 


N N 
G0) = So wosp®Y (ny + 03) -— dnt Day), 
j=1 =I] 
- , N 
TA(0) = Si (ng/nyb* (nj +. a3) - YE Mn+ Say), 
j=l j=l 


where w*-)) (x) is the (k — 1)th derivative of w(x) = (logI'(x))’. Then by 
using the asymptotic expansion 


p@Y (x) =O(1/a*""), (k= 2), 
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prove that if k > 2, 


Gi (0) = O,(1/n*), 


(k) 


n 


(0) = 0,(1/n*1), 
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Chapter 4 


Regular Posterior 
Distribution 


In this chapter, we study a special case when a true distribution q(x) is 
regular for a statistical model p(z|w) and the sample size n is large enough 
to ensure 

Posterior Distribution ~ Normal Distribution 


holds in the neighborhood of the optimal parameter wo. In such a case, the 
asymptotic behaviors of the free energy or the minus log marginal likelihood 
are derived, and the generalization loss, the training losses, the cross vali- 
dation loss, and WAIC are clarified. 

(1) At first, we explain that the posterior distribution is divided into the 
essential and nonessential parts. 

(2) Asymptotic expansion of the free energy is shown. 

(3) Asymptotic expansions of the generalization loss, the training loss, the 
cross validation loss, and WAIC are proved. 

(4) The mathematical proof is given for a basic Bayesian treatment. 

(5) Point estimators such as the maximum likelihood or a priori are intro- 
duced. 

A statistician who knows the conventional asymptotic theory can skip this 
section. 


4.1 Division of Partition Function 


In Bayesian theory, the parameter set is divided into the essential part and 
the nonessential part. The essential one is the set of the neighborhood of 
the optimal parameter, whereas the nonessential one is its complement. 
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Let (q(x), p(z|w), p(w)) be a triple of a true probability density, a sta- 
tistical model, and a prior. The average log loss function is defined by 


L(w) = - i a(e) log p(a|w)de. 


We use a notation, V = (0/0w), then VL(wo) and V?L(wo) are a d- 
dimensional vector and a d x d matrix, respectively. The matrix V?L(wo) 
is sometimes referred to as Hesse matrix of L(wo) at wo. In this section, it 
is assumed that the set of all parameters W is a compact subset of R@ and 
that there exist both a unique parameter wo and an open subset U such that 
wo € U CW C Ré and that wo minimizes L(w). It is not assumed that the 
true density is realizable by a statistical model, in general. The probability 
density function of the optimal parameter is denoted by 


po(x) = p(x|wo). 


Therefore q(x) = p(a|wo) does not hold in general. Also it is assumed that 
V¢(w) is a continuous function and that y(wo) > 0. The log density ratio 
function is 

po(x) p(a|wo) 


Flees 0) = 108 ofan) ~ 8 “pterfn) 


Hence f(z, wo) = 0. By using the average log density ratio function K(w) = 
Ex[f(X, w)], 


L(w) = K(w) + L(wo). 


Therefore K(w) > 0 and K(w) takes the minimum value zero if and only if 
w = wg. The log likelihood ratio function is defined by 


Ky(w) = — (Xi), 
i=1 


which satisfies K,,(wo) = 0. For simple proof, we assume that f(z,w) has 
a relatively finite variance and is a C* class function for sufficiently large 2, 
that is to say, 


V‘ f(a, wv) 


is a continuous function of w. Further, it is assumed that, for a sufficiently 
large k, 


'x[sup |[V°f(X, w)||*] < 00. (4.1) 
wEew 
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Then K(w) is also a C* class function. The main assumption of regularity 
is that 
J = V’K(uw0) 


is a positive definite matrix. By the assumption that f(a, w) has a relatively 
finite variance, there exists cg > 0 such that 


Ex([f(X,w)?] < oK(w), 


resulting that a function 


a(x, w) = (w) — f(Xi,w) 
K(w) 
is well defined for w #4 wo and 
sup Ey [a(X, w)?| < 00. (4.2) 


weWwo 
Remark 21. In the neighborhood of wo, 


K(w) = 5(w — wo) T(w ~ wo) + o(|}e — wll). 


By the condition that f(z,w) has a relatively finite variance, there exists 
cy, > 0 such that 


Ex([f(X,w)?] < e1||w — woll?. 


Hence eq.(4.2) holds. The function a(z,w) is bounded but may be discon- 
tinuous at w = wo. However, it can be made well-defined as a function of 
the generalized polar coordinate (w — wo) = rO, where r = ||w — wo|| and 
0 = (w — wo)/T. 


Example 20. In order to illustrate functions defined in the foregoing state- 
ment, we study a simple case, 


1 


(zulu) = 5 exp(—5 (ew)? + (y- »)?)) 


and q(x,y) = p(x, y|0,0). Then wo = (0,0). The log density ratio function, 
and its average function are respectively given by 


1 
f(x,y, u, v) 5 (ue +0" — Qua — 2vy), 


Kuo) = xe + v?). 
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The log likelihood ratio function is 


Ka u,2) = lw +’) — u( >) — o( yy): 
The function a(x, y, a,b) is 


v3 (ux + vy) 


For a given (x,y) 4 (0,0), a(x, y, u,v) is a bounded but discontinuous func- 
tion of (u,v) in the neighborhood of (u,v) = (0,0). However, by using 


a(x, Y; U, v) = 


u=rcosé, v=rsindg, 


it follows that 
a(x,y,7,0) = V2 (acos@ + ysin6) 


is a well-defined function. Note that (u,v) = (0,0) corresponds to r = 0 
and @ = free. If a true density is regular for a statistical model, then the 
same transform is always employed. In the following chapters, we show that, 
even if a true density is not regular for a statistical model, the generalized 
procedure called resolution of singularities made by algebraic geometry can 
be applied. 


We define a function 


12 
Yn(w) aa Vn d al Xj, w), 


and assume that it satisfies the asymptotic expectation condition with index 
k;. For the asymptotic expectation condition, see Section 10.5. That is to 
say, we assume that 


Yn = sup |Yn(w)| 


wewo 


satisfies E[(y))*t*°] < 00 for some €9 > 0. Firstly, we analyze the normalized 
partition function defined by 
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For a given positive value « > 0, we define 


W, = {we W;||w— woll < €}, 
Wo {w € W;||w — wol| > e}. 


Then by defining 
ZO) = | exp(—nK,(w))y(w)dw, 
Wi 


ZY) = I. exp(—nK,(w))p(w)dw, 


the normalized partition function is equal to 
ZO) = ZY) 4 ZO), (4.3) 


Here ZY) and Ze are the integrations of parameters in a neighborhood of 
the optimal parameter wo and in its complement respectively. 

In regular theory, the posterior distribution becomes to be accumulated 
on a neighborhood of the optimal parameters when n — oo. In this chapter, 
we define a positive real value € as a function of n, 

oi 
€= BB" 
Then 


lim «= 0, 
Noo 


lim ./ne = 00. 
N+ Co 
In the following, we show that, when n —> oo, 
IN Ss 7), 
hence ZY) and ge) are called the essential and nonessential parts of the 
normalized partition function respectively. 
Firstly we study the nonessential part Zo 


Lemma 9. Let J, > 0 be the minimum eigenvalue of the matrix J = 
V?K(wo) and Ky, be the mazimum value of K(w) in W. If n is sufficiently 
large, 


i) 


) 
) 


Zl 


n 


Z, 


n 


exp(—(J1/4)n/° + 97/2), 
exp(—2K in — 72/2). 


is) 
IV IA 


Hence for arbitrary k > 0, the convergence in probability nk Z2) — 0 holds. 
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Proof. By the definition of 7,,(w), 


By using 


it follows that 


nK,(w) = nK(w)/2— 4/2, 
nKn(w) < 3nK(w)/2+72/2. 


In the neighborhood of the origin, 
n 
nK(w) = gl? w — wo)||? + O(n|/w — woll?). 

Hence, in ||w — wo|| > n~?/®, for sufficiently large n, 

(J,/4)n¥5 << nK(w) < nk, 
and 

exp(—nky/2) < i p(w)dw. 

Wa 

Then by the definition, 


Z2) = | exp(—nKn(w))o(w)dw, 
||w—wol|>n-2/5 


Lemma 9 is obtained. O 


Secondly, we study the essential part of the normalized partition function 
which is the integration on the set, ||w — wo|| < n~?/>. An empirical process 
Nn(w) is defined by 


m(w) = Fa NU) ~ FH} (4.4) 


Then 
nK,(w) = nK(w) — V/nnn(w). 
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We assume that Vn,(w) and V?n,(w) satisfy the asymptotic expectation 
condition with sufficiently large index k. Two random variables are defined 
by 


to 
Sy 
I 


1 
sup |IV2na(w)l) 
wEeWw, 


—— J-V27 nn (wo). 


Then En) < oo and €, — € in distribution hold where € is the random 
variable that is subject to the normal distribution. Also we define a function 
dn(w) by 


Sn(w) = m{K(w) — 5I¥2(w — w0)|!?} 
+Vn{1n(w) — Viin(wo) » (w — wo)}. (4.5) 
By the definition, it follows that 
nKq(w) = 5 ||J'/?(w — wo) |? — VeVi (wo) « (w — wo) + dn(w): 
By the assumption that V2 (w) is a continuous function, 


il 
3) = = sup ||V°K(w)|] < co. 
6 wew, 


Lemma 10. The following inequalities hold. 


sup |din(w)| < no VPKO 4+ 0-7/9), (4.6) 
wewi 
sup ||Von(w)|| < 3n¥/5¢@) + an-/5_2), (4.7) 
wEWw, 


Proof. By applying the mean value theorem to K(w) and 7,(w), there exist 
w*,w** € W, such that 


K(w) = Slo? (w — wo)? + EVER (w*)(w — wo) 
tm w) = Vinn(to) «(w= wo) + 5V2mn(w**)(w — wo), 


hence eq.(4.6) is derived. Also by applying the mean value theorem to 
VK(w) and Vinn(w), there exist w’,w” € W, such that 


VK(w) 


1 
J(w — wo) + 5V Kw')(w — wo)’, 
Vin(w) = Vin(wo) + V7 tn (w")(w — wo); 
hence eq.(4.7) is derived. O 
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We assume that a prior p(w) is a Ct class function and that y(wo) > 0. 
Then 


g) = sup ||Vy(w)|| < oo. 
wEeWw, 


The following lemma shows the asymptotic behavior of the essential part. 


Lemma 11. The essential part of the normalized partition function satisfies 
inequalities, 


1 (Qn)4/2 lt) 
ZnS naP2 det Jia (WMO) + ED 
2, RS) mn” 
x exp(5 llEn| +7at Sue 
oe 1) 
(1) en _ et 
o n4/2(det J)1/2 (e(wo) a8) 
1 £3) i? 
x exp(5 lIfnll” — TB ak ee, 
where : 
®,(£,) = —— = ~ | /2)dw, v 
(60) = BT Da ean SPC Sela, (48) 
which satsifies 
exp(=||&nll?) 
n(n) > —— 
(En) 2(27)4/2 
for sufficiently large n. Therefore the convergence in probability holds, 
d/2 7(1 2 (2n)%/? 
nV? Z) exp(=[|énl|?/2) > der eet): 


Proof. In the region W, = {w € W;||w — wol| < n-?/5}, by the mean value 
theorem, 


y(w) < v(wo) +n 7Po, 
y(w) > v(wo) —n P/F. 
Also in the region Wj, 
1 gal? nh 
ke S Zier all ie 
nK,(w) < 5 I(nJ) (w — wo) — Enlil“ — oe 35 


1 En 2 £3) nf? 
nKa(w) > Si\(nJ)¥2(w — wp) ~ El? - lal! ee 


4.2, ASYMPTOTIC FREE ENERGY 107 
By putting w’ = (nJ)!/2(w — wo), 


‘i = (21)4/?@,, (En) 
l.. exp(—5lI(nJ)?(w — wo) ~ En? )dw = n2(det JV? 


Since ®,(  ) is the probability of the set ||J~'/?w'|| < n3/5, ®,(€,) <1. 
Moreover, 


exoISI) f ae 
®,,(€,) => ——— exp(—||w' ||")dw 
(60) & ET argryenare SPH) 
which completes Lemma. O 


In this section we proved that the essential and nonessential normalized 
partition functions satisfy 


as random variables. 


4.2 Asymptotic Free Energy 


In this section, we derive the asymptotic expansion of the free energy or the 
minus log marginal likelihood F;,,. A matrix I is defined by 


I=Ex [Vf (x, wo) (VF (a, wo))”].- 


Then by the definition, 


BS VF (Xi, wo)(VF(Xi,uo))"] =F 
i=1 


and by the central limit theorem, 


~ 2 VF (Xi, wo)(VF(Xest00))? =J+ O,(1/n?/?). 
i=1 


If g(x) is realizable by p(x|w), then J is called the Fisher information matrix 
at wo and J = J. However, in general, J 4 J. By the definition of J, it 
is positive semidefinite. In the regularity condition of Bayesian theory, the 
eigenvalue of J positive, but J may contain a zero eigenvalue. 
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Theorem 4. /f the regularity condition holds, then the Bayesian free energy 
Fi, satisfies 


d d 
Fn = nln (wo) + 5 log n — log p(wo) — 5 log(27) 


1 1 
+5 log det J — 5llénll? + 0,(1). (4.9) 
d d 
E[Fn] = nL(wo) + 5 logn — log p(wo) — 5 log(27) 
1 1 
+5 log det J — zit(J*D) + o(1). (4.10) 


Proof. By the assumption, €, and V?n,(w) converge in distribution and 
their expectation values also converge. By the definition, 


F,, = F©) + nLy(wo), 


where FO is a normalized free energy, which is given by the essential and 
nonessential parts, 


FO = log ZO 
= loa 2) 2), 


When n — oo, convergence in probability ®,(€,) — 1 holds, where ®,,(&,,) 
is defined by eq.(4.8). By Lemma 9 and 11, covergences in probability 


(Qn) 4/2 
(det J?” 
nVl2Z2) _, 9, 


1 
ni? Z exp(—slléall”) (wo), 


hold. It follows that eq.(4.9) holds. Let us prove eq.(4.10). Let € bea 
random variable which is subject to the normal distribution whose average 
and covariance are equal to those of €,. Then by the central limit theorem, 
En — € in distribution holds. By the convergence in distribution, 


(2n)4/2 


d/27(1) _, ASO) 
ve en Caet ua? 


(wo) exp(5 IKI?) 


and convergence in probability n%/ 27,2) — 0, the convergence in distribution 


d/2 
nd/2.z(0) , 27) / 


1 
M Taet ni Pw) exPG EI) 
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holds. Hence Fi) — (d/2) log n also converges in distribution. In order to 


prove the convergence of the expected value ERO — (d/2) log n)], it is suf- 


ficient to prove that the sequence of random variables (FO — (d/2) log n) is 
uniformly asymptotically integrable. To prove it is uniformly asymptotically 


intgerable, it is sufficient to prove ERO — (d/2) log n|!+*] < oo for some 
é€ > 0. (See Section 10.5). We define two random variables, 


Any _ log(n4/?7Z), 
Bn = log(n4/?2Z)). 


Then 
(d/2) log n — F© = log(e4 + e?"). 


For arbitrary real numbers z, y, 
x < log(e* + e”) < max(z, y) + log 2. 


Hence 
| log(e* + e¥)| < max(|z|, | max(a, y)| + log 2). 


Therefore 
[1og(e4" + €8")| < max(|An|,|max(An, By)|) +log2. 


By Lemma 9 and 11, and 


B, (Eq) > exp(—llEnll?) | exp(—|le|)dw, 


||w—wol|<1 


there exists constants cy, c2,c3 such that 


An S 1+ (l€nll?/2 +n? nF, 
An > ¢2— llénll?/2—n? /n*”, 
max(An,Bn) < ¢3+ IIEnl|?/2 + ni /n3/® a 93/2. 


Then by 
max(An, By) > An, 
there exists c, > 0 such that 


|(d/2) logn — FO | < |l&nl|?/2 + n® /n3/ + ca. 
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By the assumption €, and nf? have the asymptotic expectation condition 


with index k, Opa — (d/2) log n|+**] < co for some € > 0. Lastly, since 


E[Vinn(wo)(Vinn(wo)) | = [PF Ce.) (VF(0, wo) "aloe, 


it follows that 


e(nll7] = Eltr(J~?V mn (wo) (Vn (wo))”)] 
tr(J*E[Vnn(wo)(Vitn(wo))"]); 


which completes the theorem. O 


Remark 22. In this section, on the regularity condition, J > 0 and y(wo) > 
0, we proved that 


d d 
Fy = nLn(wo) + 5 logn — log p(wo) — 5 log (27) 
1 1 
+5 logdet J — 5|l&nll + ep(1) 


and 


E[||&n||7] = tr(LJ~") + o(1). 


If det J = 0, det J = oo, y(wo) = 0, or y(wo) = ov, then this asymptotic 
expansion does not hold because 


log det J, log y(wo) 


are not finite. For the case when the regularity condition is not satisfied, 
see the following sections. 


Example 21. Let 2,w € R™ and y € R, and a statistical model is 
ae : 4.11 
Pyle, w) = Trap exP(— ay — w 2) ). (4.11) 


Assume that X is subject to some density q(x). The matrix J of the statis- 
tical model eq.(4.11) is 


Jik = [eimeateae. 


The support of g(x) is defined by 


supp q = {x € R” ; q(x) > 0}, 
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where A of a set A C R™ is the closure of A. If the support of q(x) is 
contained in a subspace of R’” whose dimension is smaller than M, then 
det J = 0. In real world problems which are defined on high dimensional 
space, the support of g(a) is sometimes contained in a low dimensional sub- 
space. In such cases, the regular asymptotic theory does not hold. However, 
such cases can be analyzed by the general theory. 


Example 22. Let p(az|w) be a statistical model and assume that the optimal 
parameter wo = 0. In statistical estimation, sometimes a prior 


p(w) x jw|*~ 


is employed, where a > 0 isa hyperparameter. Then the regularity condition 
is satisfied if and only if a = 1. The cases a 4 1 can be analyzed by the 
general theory. 


4.3. Asymptotic Losses 


In this section, we show asymptotic behaviors of the generalization, cross 
validation, and training losses when n — oo, based on the regularity condi- 
tion. The following lemma is necessary in this section. 


Lemma 12. Assume the regularity condition. Let k (k > 2) be an integer 
and g(x,w) be a function which satisfies g(x,wo) = 0 and assume that, for 
a sufficiently large integer £, 


Ex[sup |g(X,w)|{] < 0, 
wew 


Ex[sup ||Vg(X,w)||] < oo. 
wEew 


We use a notation, 


Bow [exp(—af(X, w))] 


Then, there exists ¢ > 0 such that 


L(X,a) = 


£| (mh? sup ExIL(X, a) aa Scie (4.12) 


E| (nk? sup ee) aa  ~6o; (4.13) 
i=1 


lejai = 
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The proof of this lemma is given in the following section. In this section, 
we employ Lemma 12 and prove asymptotic expansions of the generalization, 
cross validation, and training losses. 


Remark 23. The special case a = 0 and g(X,w) = g(w) shows that 
Ew(n*/?|g(w)|*] is asymptotically uniformly integrable. Therefore, if the 
random variable E,,[n"/*g(w)*] converges in distribution, then a sequence 
Sw [n*/2g(w)*] also converges. 


Firstly, the basic Theorem 3 in the foregoing chapter is proved in regular 
cases. 


Theorem 5. Based on the regularity condition, the assumptions of the basic 
Theorem 8 are satisfied, 


op (<=) Gnla)| < Opp), (4.14) 
d \k 1 

sup |(Zz) Tela] < On(ea). (4.15) 

elsup (=) "ra(a)| < O(a). (4.17) 


Proof. By applying Lemma 8 and Lemma 12 to the case g(x, w) = f(z, w), 
where f(x, w) is the log density ratio function, eq.(4.16) and eq.(4.17) are 
immediately derived. By Lemma 12, the random variables 


*? sup Ex(L(X,a)], 


n 
Jo|<1 
1 n 
k/2 
n’* sup — Y L(Xj,a) 
Jos" dX 


are asymptotically uniformly integrable, resulting that they are also uni- 
formly tight. Hence eq.(4.14) and eq.(4.15) are obtained. oO 


Definition 11. Let (71, %2,...,i,) be a set of integers and w,; be the jth 
element of the vector w. The constant ®(71, i2,...,7:) is defined by 


a oe oe wll? 5 
11,12, 5%) = Qn Wi, Wig 1° Wiz exp(———> —) W, 
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which is the expected value of w;,wj;,---w;, with respect to the normal 
distribution. In other words, if X is subject to the d dimensional normal 
distribution whose average and covariance matrix are zero and identity re- 
spectively, then 


®(i1, 22, eaagil) = ix |Xi XG sent XG |. 


By using this definition, in oder to derive the asymptotic behaviors of 
the generalization, cross validation, and training losses, we first show the 
correlations of parameters in the posterior distribution. 


Lemma 13. let wo be the optimal parameter and €,, = J~!/2Vny(wo). For 
an arbitrary 11, %2,...,i4, a function %,,(w) is defined by 


Ein(w) = {((nJ)/?(w — wo) — En)ir}--- {((n J)? (w — wo) — Enact: 


Then their averages by the posterior distribution converge in probability, 


Sy[En(w)] + Bir, 2, .--, 44). 


Proof. For the essential and nonessential sets of parameters W, and W2, we 
define 


a oe i. in) expt i ea dee 


Ziq, 5%) = I. Un (w) exp(—n Ky, (w))p(w)dw. 


By the same way as Lemma 9 and 11, the convergences in probability can 
be derived, 


(2n)4/2 
(det J)!/2 


. . 1 
nV? Zs, assy be) exp(—5llénll”) —> 0. 


1 _ 
9? ZA) (iy ns te) eXP(— 5 IIEnll”) ®(i1, ig, ..., 4) (wo), 


Note that for the case (71, 72, ...,i4) is the empty set, we define Zao (2) 
and ge) = Z2) (a), Then 
BO Gy tt) + ZP Gy ot) 
fo 2S 
Zn + Zr 
=> P(i1, ioe it), 


which completes the lemma. O 
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Lemma 14. The following hold. 


Ewl(w—w)] = (nd), + op(1/Vn), 
By[(w — wo)(w —wo)"] = 2 (dh + IMG ET IV?) + op(1/n), 
ci i wie-ae | = =(I7 + JWT) + o(1/n). 


Proof. By Lemma 13, if t = 1 then ®(7,) = 0, hence 


(nJ)/?E,,[(w — wo)] — En 2 0, (4.18) 


which shows the first equation. By Lemma 13, if t = 2 and ®(i, 7) = 6;;, the 
convergence in probability holds, 


Ewl{((nJ)/?(w — wo) — En) H{((nJ)'/?(w — wo) — En)}7] > Ta, 


where Jq is the d x d identity matrix. By eq.(4.18) and convergence in 
distribution of €,, 


nJ 7 Ey, [(w — wo)(w — wo) | I? — &€F > Ig. 


By Lemma 12, 
nJV/?E Ey[(w — wo)(w — wo)? |J/? — I Ia, 
which completes the lemma. O 


Lemma 15. The cumulants satisfy the relations, 


Gi (0) = —L(wo) - Nel + 0,(1/n), (4.19) 
g,(0), = — + 0,(1/n), (4.20) 
TE(0) = —En(y) + 2A Nalh 5 0,(4/n, (4.21) 
70) = _ + Op(1/n). (4.22) 


Proof. By the mean value theorem and Lemma 14, 


Ew [AK (w)] = Bul lo (w — wo)I?] + op(1/n) 


= 5 Ew|tr(J(w — wo)(w — wo)? )] + op(1/n) 


= 5-(d + [Eul?) + op(1/n). 
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Also by the mean value theorem, 


f(z,w) = (w—wo): VF(x, wo) 
+50(V?F(e, wo)(w — wo)(w — wo)") 


+0O(||w — woll®). 
Hence, 
“wif (z,w)] = Ewl(w—wo)]- Vf (x, wo) 
+50(7? F (2, w9)Ew{(w — wo)(w — w0)"}) 
+0p(1/n) 
= = le Vi, wo) 
+5-0(V?F(2, wo)(Jo! + es es) 
+o(1/n) 
and 
Ewlf (a, w)*] = Ey|((w — wo) - Vf(Xi, wo))?] + op(1/n) 
= tr(y[(w — wo)(w — wo)" VF (@, wo) (V F(x, wo)” ) 
+0,(1/n) 
7 <tr((4 $ IMG LIAWPYY f (0, 00)V F(a, 0)" ) 
+0, (1/1). 


Then by using 


Se VI Ket) = —J6,, (4.23) 
c=] 
~ 52 V?F(Xiswo) = J+ 0,(1) (4.24) 


71 
Exy(V7f(X,we)] = JF (4.25) 
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It follows that 


_Ubal? (dtd + IGE I-12) 


n 2n 


+0p(1/n) 


~S Evlf(Xiw)] = 
i=1 


d—||€all? 
= oo Heol laaee res n 


By using the results above, 


n 


: > Elf (Xi, w)? = Ten Vf (Xi, wo))? + op(1/n) 
i=1 


i=1 


= + + te (IME nee IVY F(X, wo)V F(X, wo)") + op(1/n) 
i=1 


1 
a Ht ((EngE IOUT) + op(/n) 


and 
_ ; 
= y Ew Lf (Xi, w) ] 
rn « 
i=1 
_— 7 _ 7 
= ote (Tt + EI VF (Xi, wo) VF (Xi, wo)” ) 
i=1 
+0p(1/n) 
= “ie((F7 + etme!) + 0p(1/n) 
Then by using eq.(3.14) through eq.(3.17), the lemma is derived. O 


Theorem 6. (Regular asymptotic theory) Assume that a true distribution 
is regular for a statistical model. Then the generalization loss, the cross 
validation loss, the training loss, and WAIC are asymptotically equal to 


d+ ||&ll? — tr(ZJ™) 


Gr, = L(wo)+ Fn + o,(1/n), (4.26) 
Ty = En(p) + SMa OO) 5 oan), (427) 
— Lin (tio) + 2 Moll FOOT 5 o(1/n), (4.28) 
Wa = Lq(w) + all + CI) sg any). (4.29) 


2n 
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Proof. This theorem is obtained by combining Theorem 3 and Lemma 4.19. 
O 


Theorem 7. (Expectation of regular asymptotic theory) The asymptotic 
expansions of the generalization loss, the training loss, the cross validation 
loss, and WAIC are given by 


MGr] = L(wo)+ < + o(1/n), (4.30) 
ET,| = L(w) + 2= a Lotti: (4.31) 
E[Ca] = L(wo) + £ + o(1/n), (4.32) 
E[W,] = L(wo)+ < + o(1/n). (4.33) 
Proof. Since E[Gn-1] = E[Cp], E[||€n||?] = tr(2J~!)+0(1). Then by Lemma 
12. wo obtain Theowent 7 o 


Remark 24. The equation E]||€,,||?] = tr(ZJ—+) + 0(1) can be proved by two 
methods. The first proof is derived from the definition 


ae FASS (Kit) 
i=1 


The second is given by 


E|G,-1] = E|C,]. 


If Xy, Xo, ..., Xn are independent, then both methods can be used. However, 
if they are not independent, then the second method cannot be used. In such 
a case, the first proof can be applied. 


Remark 25. (1) By the theorem, the convergence in probability holds, 
(Gr — L(wo)) +2(Cn — Ln(wo)) > 4, 


where d is the dimension of the parameter. That is to say, for a given triple 
(q(x), p(az|w), y(w)), if (Cr —Lm(wo)) is smaller then (Gy, —Ln(wo)) is larger. 
(2) In regular asymptotic theory, the functional variance V,, is given by 


Vn 


~ > Vallog r(Xilu)] 
i=1 
BET) + op(1/n). 
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4.4 Proof of Asymptotic Expansions 


In this section, we prove Lemma 12, which is the base of the asymptotic 
expansion. The random variable we study is 


Ey = Ewllg(X, w)I* exp(—af(X,w))] 
Ewlexp(—of (X,w))] 


Then it is sufficient to prove that E[|Y|!**] < oo for an arbitrary ¢ > 0. Let 
W, = {w;||w — wol| < n-2/5} and W2 = W \ Wi, and 


€ 


Y = sup 
|o|<1 


Y, = E,lexp(—af(X,w))liwiy3, (4.34) 
Yo = Evwlexp(-af(X,w))| ww}, (4.35) 
¥3 = n*?E.y[|g(X,w)|* exp(—af (X,w))] wy, (4.36) 
Ys = nk/?8,,[|g(X,w)|* exp(—af (X, w))] pvo}s (4.37) 


where E|f(X)],s} is the expected value with restriction a set S. Then Y 
can be rewritten as 


Y = sup Ex| 
Jo|<1 


ao A) 
Yj+ Yo! 


Since Yj, Yo, Y3, Y4 > 0, 


Firstly, let us study Y3/Y1. 


Lemma 16. Let a,c,D,n > 0 and d > 0 be arbitrary real constants, and 
fi(u) and fo(u) be real-valued continuous functions of u € [0, D] C R which 
are differentiable in (0,D). For nonnegative integer m > 0, Zm is defined 
by 


D 
2m = [ (au?°)'"/2u4 exp(—nau° + Vnau® fi(u) + fo(u))du. 
0 
Then 


Z gm/2 qm ats gm Rm/2 
Sk ga SE 
Zo yym/2 : 


where, by using a definition q = (m—2)c+d+1, 
(csup | fi(u)| + Dsup|f{(u)|)/(2c), (4.38) 


(q+ Dsup | f2(u)|)/(20). (4.39) 


A 


B 
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Proof. For the proof we define 
H(u) = exp(vnau* fi(u) + fo(u)). (4.40) 
Note that (2c — 1) +q=cm+d. By using partial integral, 


D 
Ly, = [fant expan?) ut) 


_ 1 2 2c qd 
= sf Ou{exp(—nau“°) {ul H(u) du 


2cn 


1 2c 
= —5—[exp(-nau ){ul H (u)} : 
1 pd 
+s f {exp(—nau°)}0, {ul H (u)}du 
< = exp nani®) Ou u! Hu) a. (4.41) 
Then by the assumption, 
Ou {uTAT (uw) } ul H(u){q + ufg(u) 
+eVnau’ fi(u) + Vnau* fi(u)} 
< ut TH (u){q + Dsup|f| 
+Ynau'(csup|fil + Dsup|fil)}- (4.42) 


By applying eq.(4.42) to eq.(4.41) and using definition of Z,-2 and Zm—-1, 


Zm- Zm- 
Zim S FE=(a+ Deup|fal) + 5772 (esup [fil + Dsup| fi). 


2c 
By using definitions of eq.(4.38) and eq.(4.39), we obtain an inequality, 
Zz A Zm_ B2Zy- 
% Wa ae tn te 
By Cauchy-Schwarz inequality, 
Zient 2 (= ae) 
4 ~~ \Zo Zo 


Hence 


Zm < A (= Zn)" B Zm—2 


Z ~~ Yn\Zo Zo n Lo 
fe A, Bas 
_ Ca ZB, ) n Zo 
ee a) BZm-2 
=. 2 Zo n Zo nm Zo 
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It follows that 


Zm 2 A? +2B Zp 
Zo n Zo 


Hence 


Fern Z (A? 4B m2 2 gm/2 qm 4 9m Bm/2 
ae — yym/2 = yym/2 : 


which completes the lemma. 
O 


Lemma 17. Let Y; and Y3 be random variables defined by eq.(4.34) and 
eq.(4.36) respectively. We define 


T(X) = sup ||J7?V9(X,w)|]. 


weEWw, 
Then y. 
23 2 Taras a ek. 
bai 
where 
A < a||Vin(wo)l (4.43) 
Bo < q+eqn72/(sup |Vuf(X, w)| + sup |Vwdn(w))). (4.44) 


Here by using q = (2k — 2) +d, and c1,c2 > 0 are constants determined by 
J. Hence E[(Y3/Y1)!**] < oo. 


Proof. By the definitions, 


Yi Iw, exp(—n Ki, (w) — af (X, w) p(w) dw 


where 


nKn(w) = sll? (w — wo) ||? — Vinn(wo) - (w — wo) + dn(w). 


By J‘/?(w — wo) = r6 where 6 is the generalized polar coordinate with 
|0| =1. Then 
[Fw = wo) |? = 7, 
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By 
T(X) = sup ||J-'?Vf(X,w)| 
wew, 
it follows that 
|f(X, w)| < rT(X). 


Hence 
Y¥3 <T(x) Jw, (nr?)*/2 exp(—nr? + /nr fi(0) + fo(r, 0))r¢-!drdé 
7 Sw, exp(-nr? + Jnr fi(9) + fo(r,0))rotdrdd 
where 


fi(0) = I~ V?V (wo) - 6 
falr, 0) — =f (X, Tr, 0) al log y(r, 0) — On(r, 0) 
By applying Lemma 16, 
= <7 (X)F (2k Am 4. 92k BR), 
1 


where, by using g = (2k — 2) +d, D = ||.J*/?||n-2/5 


sup | f1(8)|/2 
0 


(q + Dsup |0, fa(r, 9)|)/2, 
(6.9) 


A 


B 


where we used the inequality 
Of Of Ow 1/2 
ee es ee : 
lar en Or | <|VFlllly 


Lemma 18. Let 


F(X) = sup |f(X,w)|, G(X) = sup |g(X, w)]. 


wEew wEew 
Then 
y, = 
FS TX) exp(2aF(X)) exp(-(Ji/A)n"> + (k/2)log n+ 7). 
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Proof. Then by definition, 


Y, > exp(—af(X))Z{). 

Ya < n¥/?q(X)* exp(af(X))Z2). 
Hence 

Ya/Vi < n¥g(X)* exp(2aF(X))Z2)/Z0, 
< nb/9(X)* exp(2af(X)) exp(—(J1/4)n'/? + yn). 
On the other hand, 
Ys/Yo < n*/?2g(x)* 

is derived from the definition of f(X). O 


Lemma 19. Let A, and B,, be random variables. Assume that 


M = supE|(|An| + [Be < 00 


and that {a, > 0} is an increasing sequence of real values. Then 
(| Ar|"” mini exp(A, + By —a,),n""}] 
< E[|A,|""] + Mn™? /(an)?. (4.45) 


Proof. By the assumption 
M =E((|An| + |Bal)™*4] 


is finite. Hence 


M > (an)E{(\Anl| + |Bnl)"]{Ani+1Bal>an}- 
Let E,, be the left hand side of eq.(4.45). 


En < El|Anlg)antBnl<an} + Eln*?| An] Anl+|Balzan} 
< El|A,|""] + Mn™?/(an)’, 


which completes the lemma. O 


Let us prove Lemma 12 


Proof. (Proof of Lemma 12 ). By Lemma 17, E[|¥3/Y,|'t*] < oo. In Lemma 
19, by putting m = k+e, An = sup, |f(X,w)|, Bra = Yn, and an = 
(J, /4)n/> — (k/2) log n, and £ = 5m/2, 


e[min((Syh) <= 
ee. 


which completes the lemma. oO 
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4.5 Point Estimators 


In this section, we study other statistical estimation methods when a true 
distribution is regular for a statistical model. 


Definition 12. The maximum likelihood estimator wjyz, the maximum 
a posteriori estimator wyy4p, and the posterior mean estimator wpyy are 
defined by 


n 


WML = arg may | [o(Xilw) (4.46) 
n 
WMAP = are mas o(w) | Toile) (4.47) 
wpm = Ey|w], (4.48) 
where “arg max f(w)” is the parameter that maximizes f(w). The 


procedures in which a true distribution q(x) is estimated by p(z|wazz), 
p(z|wyap), and p(z|wpas) are respectively called the maximum likelihood, 
maximum a posteriori, and the posterior mean methods. 


If a true distribution is regular for a statistical model, by using 
n 
nKa(w) = 5 I¥/?(w — wo) |? — Vn(wo) + (wv — wo) + op(1), 


these three estimators are asymptotically equivalent, 


J7W2¢ 1 
WML = Wor Vn + op(—=), 
Jee. | 1 
WMAP = Wo + —==— TF Op(— =), 
Jn n 
Ve, 1 
wepM = wot a - Ook a 


Remark 26. (1) Even if a same prior distribution is employed, p(x|wya4p) 
and p(z|wpy) are different from the Bayesian estimation E,,[p(z|w)]. In 
some books and papers, p(z|wyap) and p(a|wpyz) may be called ‘Bayesian 
estimation’ and E,,[p(a|w)] is called ‘fully Bayesian estimation’. 

(2) If a true distribution is not regular for a statistical model, then wyyz, 


wm Ap, and wpy are not equivalent even asymptotically. 
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Lemma 20. The generalization and training losses of the maximum likeli- 
hood method are defined by 


Qiu). <= Lene), (4.49) 
Tr(ML) = Ly(wmt). (4.50) 


If a true model is regular for a statistical model, 


2 
GML) = E(u) + Hol 4 0,(4), (4.51) 
1 


: 
2 
Meal + op(). (4.52) 


T, (ML) = Lig (wg) — 
The generalization and training losses of the maximum a posteriori method 
and posterior mean method are asymptotically equivalent to those of the 
maximum likelihood method. 


Proof. The generalization and training losses by the maximum likelihood 
method are given by 


G,(ML) = L(wo) + K(wmt), (4.53) 
TIMEY = DyCin) +; (tours): (4.54) 
By J = V?K (wo), 
K(wmit) = ey = = -V?K(wo)(wo — wuz) + on(=) 
On PAn 


and by VKn(wo) = —(1/V72)Vim (wo) = —A/ Vn) IE, and 
V?Kn(wo) = VK (wo) + op(1), 
it follows that 
Ky(wuzt) = (wo- wot): VEn(wo) 
tai — wm): V?Kn(wo)(wo — waz) + op(=) 


2 


nll? 1 
_ _ lal? | 4 


2n n 
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The generalization and training losses of the maximum it a posteriori 
and the posterior mean methods have the same asymptotic behaviors as 
the maximum likelihood method. If a true distribution is realizable by a 
statistical model, then J = J, resulting that these asymptotic losses are 
equivalent to those of Bayesian estimation in Theorem 6. 


Remark 27. (Deviance and functional variance) In Bayesian estimation, the 
deviance Dev and functional variance V are defined by 


Dev = ~2{9>Byllogp(:lu) - Yost [Ew(w))}, 


V = = S°{E[(log p(Xilw))?] — Eflog p(Xlw))? }. 


If a true distribution is regular for a statistical model and if n is sufficiently 
large, 


Dev = 2{Ey[Kn(w)] — Kn (Ewlw))} 
_ 20 2 
2 Meal? -Ily (2) 


2n 2n 
= ~+o(2), (4.55) 
and 
V = T'(0)= iC ie Katia. (4.56) 


Note that eq.(4.55) and eq.(4.56) hold even if a true distribution is not 
realizable by a statistical model. 


In practical applications, we do not know the true distribution, hence 
the optimal parameter wo is unknown. Let w be the MAP estimator. Let 
us introduce a numerical calculation method for the free energy, 


fF, = tog [ exp(—nL£(w)) dw, 
Here £(w) is the sum of the log likelihood function and log prior, 


L(w) = nLn(w) — log p(w). 


By the regularity condition, 


nL(w) © nLn(t) + 5 (w ~ tb) In (w — w) + op (1), 
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where we used a notation, J, = V?L,,(w). It follows that 


d 1 
Fy, = nLp(w) + ‘ logn— 5 log(27) + 5 log det Jp, — log p(w) +0p(1). (4.57) 


The information criterion BIC is defined by 


BIC = nL,(w) + log n. (4.58) 

The cross validation and training losses are 
Crh © Ln(w) + aaa + Op(1/n), (4.59) 
Tn © Ly(w) + s_S + 0y(1/n), (4.60) 


where 


4.6 Problems 


1. Let p(az|m,s) and v(m, s|¢) be a statistical model and a prior given by 
eq.(2.1) and eq.(2.2), respectively. Show that the MAP estimator (7, §) is 


m= 2/3; 

§ = 43/(¢163 — 3), 
where 1, $2, and ¢3 are given by eq.(2.6), (2.7), and (2.8). The log loss 
function is given by 


af log s a 9 
Le) = 5 log (27) — om dX (X; —m)°*. 
Show that the matrix [,(m,s) is given by 
2 
(Un)ii(m,s) = — ) (Xi-—m)’, 
i=1 
i , 
(Un)i2(m,s) = 5° D (1 — s(Xi —m)")(Xi — m), 


(In)22(m, 8) = Te /s — (i — m)’)’, 
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4.5 


[bee 


——*— F- n*Sn 
——S— Asymptotic F - n*Sn 4 
—4—_ BIC -n*Sn 


0 5 10 15 20 25 30 


sample size n 


Figure 4.1: Free energy and its asymptotic form in normal distribution. 
Values F;, — nS, are compared with the asymptotic form and BIC. In a 
normal distribution, F;, — nS, can be approximated by its asymptotic form 
when n > 10. The difference between F;, and BIC is a constant order term. 


and (In)21(m, s) = (In)i2(m,s). Also the matrix J,(m,s) is given by 
(In)ii(m, s) = &§, 
1 nm 
n ) = ar Xi, 
GJ, )i2(m Ss) mm a > 


(Jn)2a(m,s) = 1/(2s”), 


and (Jn)21(m, s) = (Jn)i2(m, 5). A true parameter wo and a hyperparamter 
@ are determined as Example 5. The free energy or the minus log marginal 
likelihood F;, and the empirical entropy are given by eq.(2.9) and L,,(wo), 
respectively. The asymptotic form of F;, and BIC are given by eq.(4.57) 
and eq.(4.58), respectively. In Figure 4.1, F,, — nS, is compared with its 
asymptotic form and BIC, by using the numerical calculation. In a normal 
distribution, the free energy can be approximated by its asymptotic form 
when n > 10. The difference between F;, and BIC is a constant order term. 
In Figure 4.2, tr(I,(w)J,, }(#)) is compared with nV and n(C;,—T,), where 


n 
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2.5 
27 ; 4 
L- es 
1.5 7 aa 4 
—*— tr dd 1) 
—s—n xv 
——S— nn x (CV-TE) 
, 
0 10 20 30 40 50 60 


n: sample size 


Figure 4.2: Comparison of tr([J~') with V in normal distribution. Values 
tr(I,(w)J,,}(w)) are compared with nV, and n(CV — TE), where V is the 
functional variance, CV — TE is the difference between the cross validation 
error and the training error in a simple normal distribution. 


V and C,, — T;, are the functional variance and the difference between the 
cross validation and training loss. In this case, the variance of the n(C,—T,) 
is larger than the others. The values n(C;, — T,,) and nV are approximated 
by tr([,J, +) when n > 40. From the numerical point of view, the precise 
approximation of the free energy does not ensure that of the generalization 
loss. 


2. A statistical model is defined by 
p(zla, b) = (1—a)N(x) +aN(a — 5), 


where N(x) is the probability density of the standard normal distribution. 
A true density is set as q(x) = p(x|0.5,1) and the uniform prior for (a, b) 
is given by [0,1] x [0,2]. Then the true distribution is realizable by and 
regular for a statistical model and the maximum a posteriori estimator is 
equal to the maximum likelihood estimator. The free energy or the minus 
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——*— F- n*Sn 

=]. - ——S— Asymptotic F - n*Sn | 

—4—_ BIC -n*Sn 

-2 : : : ; : 
(0) 100 200 300 400 500 600 


sample size n 


Figure 4.3: Free energy and its asymptotic form in a normal mixture. Values 
F,— nS, are compared with its asymptotic form and BIC. In a normal 
mixture, F,, — nS, can be approximated by its asymptotic form when n > 
200. The difference between F;, and BIC is a constant order term. 


log marginal likelihood F,, and the empirical entropy are given by eq.(2.9) 
and L,,(wo), respectively. The asymptotic form of F;, and BIC are given by 
eq.(4.57) and eq.(4.58), respectively. In this experiment, the integration of a 
function f(a,b) over the parameter (a,b) is performed by the Riemann sum 


N N 


1 2 
[af f(a,6)dadb = => > fGj/N —1/2,2k/N — 1). 
0 0 7k 


1k=1 


In Figure 4.3, F, — nS; is compared with its asymptotic form and BIC, by 
using the numerical calculation. In a normal mixture, the free energy can be 
approximated by its asymptotic form when n > 200. The difference between 
F, and BIC is a constant order term. In Figure 4.4, ntr(Ip(t)J,,1(w)) is 
numerically compared with nV where V is the functional variance. The hor- 
izontal and vertical axes show the sample size and the average and standard 
deviation for 100 sample sets X”, respectively. The true parameter is the 


regular point for the statistical model. Let M(x) = N(x) — N(x — b) and 
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—*— tr 1) 
2 <7 nxv | 
—e— n x (CV-TE) 


0 100 200 300 400 500 600 


n: sample size 


Figure 4.4: Comparison with tr([J~+) with V in normal mixture. Values for 
tr(I,(w)J,,'(w)) are compared with nV, and n(CV — GE), where V is the 
functional variance, C'V is the cross validation error, and TE is the training 
error in a normal mixture. 


p(x) = p(zla,b). Show that 


(In)ir(a8) = = 0 M(%)?/0(%i)?, 
i=1 

(In)i2(0,b) = = S aM (X)N' (we — 8) (p(X), 
i=1 


(In)22(a,8) = — Yan" (OP (Xi)? 
1=1 


and ([,,)21(a, 6) = (I,)12(a, 6). Also show that the matrix J,(a,b) is given 
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by 
(n)n(a.8) = = SO MCG)P OG), 
(In)ro(a,b) = = SON" (a —B){(X) + aM (X)}/ P(X), 
(In)oo(a,b) = =) {—aN"(X; = 8)/p + (aN"(Xi —))?/0?}, 


and (Jn)21(a,b) = (Jn)12(a,b). The varinace of tr(I,(w)J,'(w)) is larger 
than those of nV and n(C,, — T;,). The dimension of the parameter space 
is two, and the true parameter is a regular point for the log loss function. 
However, the regular theory needs n > 500. From the numerical point of 
view, the precise approximation of the free energy does not ensure that of 
the generalization loss. 


3. A neural network is defined by a conditional density 


exp(-5y —atanh(bz))?). 


1 


A probability density of x is set as the uniform distribution of [—2,2]. A 
true conditional density is set as p(y|x,1,1), and the uniform prior for (a, b) 
is given by [0,2] x [0,2]. Then the true distribution is realizable by and 
regular for a statistical model and the maximum a posteriori estimator is 
equal to the maximum likelihood estimator. The value ntr(J;,(#)J, '(w)) 
is numerically compared with nV where V is the functional variance. The 
experimental result is shown in Figure 4.5. The horizontal and vertical axes 
show the sample size and the average and standard deviation for 100 sample 
sets (X”,Y"”), respectively. In this case the true parameter is the regular 
point for the statistical model. Let 


Zo(x) = tanh(bz), 
Z(z) = (1—Zo(2)?)z, 
Zo(x) = —229(2)Zi(a)e 
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3 
2.5 7 7 
a+ 4 
1.5 7 7 
—— tru 1) 
—s—_n x V 
—S— n x (CV-TE) 
1 
0 200 400 600 800 1000 1200 


n: sample size 


Figure 4.5: Comparison with tr([J~') with V in neural network. Values for 
tr(I,(w)J,,'(w)) are compared with nV, and n(CV — GE), where V is the 
functional variance, C'V is the cross validation error, and TE is the training 


error in a neural network. 


(In)us(a8) = = S(aZ0(Xi) ~ VP Z0(%)?, 
i=1 

(In)i2(a,d) = = Ya(azZo(X) — ¥i)?Z0(Xi)Za(Xi, 
i=1 

(In)20(a,6) = = 0%(azZo(Xi) ~ YZ (Ki), 
i=1 


and (J,,)21(a, 6) = (n)12(a,b). Also show that the matrix J;,(a,b) is given 
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by 


(Jn)11 (a,b) 


1 n 
a 2 Zo(Xi)*, 
i=1 


(In)r2(asb) = = S*20%0(Xi)Z(X) — VAX}. 
i=1 

(In)ax(a,b) = — Sa?Zi(Xi)? + alaZo(Xi) — ¥i)Za(Xi)} 
i=1 


and (Jn)21(a,b) = (Jn)i2(a,b). The varinace of tr(In(w)J;,,1(w)) is largar 
than those of nV and n(C;, — T,,). Tthe dimension of the parameter space 
is two, and the true parameter is a regular point for the log loss function, 
however, the regular theory needs n > 1000. In practical applications, neu- 
ral networks which have many parameters are employed, hence the regular 
theory does not hold in all cases. However, the general theory introduced in 
the following chapters holds. 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


Chapter 5 


Standard Posterior 
Distribution 


If a true distribution is regular for a statistical model and if the posterior 
distribution can be approximated by a normal distribution, the difference 
between Bayesian and maximum likelihood estimations is not so large. How- 
ever, the posterior distributions are often far from any normal distribution, 
showing that Bayesian estimation gives the more accurate inference than 
other estimation methods. In this chapter we study the case when the pos- 
terior density p(w) is asymptotically given by 


p(w) « exp(—n wrt weke ... uy), 
It might seem that such a posterior density appears in a special case. How- 
ever, in the next chapter, we show that most posterior densities are mathe- 
matically equivalent to this function. Therefore, the results of this chapter 
are the universal laws of Bayesian statistics. 
This chapter consists of the following parts. 
(1) A standard form is introduced and the real log canonical threshold is 
defined. 
(2) Asymptotic property of a state density function is derived. 
(3) Asymptotic behavior of the free energy or the minus log marginal like- 
lihood is represented by the real log canonical threshold. 
(4) By using the renormalized posterior distribution, mathematical laws 
among the generalization loss, cross validation loss, training loss, and WAIC 
are established. 
(5) If random variables are conditionally independent, the relation between 
the generalization and cross validation losses does not hold. However, the 
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mathematical theorem which connects the generalization loss and WAIC can 
be proved. 


5.1 Standard Form 
In order to define a standard form, we need a representation of a multi-index. 
Definition 13. (Multi-index) A d-dimensional multi-index k is defined by 
b= (Bis Riya): 
For a multi-index k, notations k > 0 and k > 0 are defined as follows. 
k>O0< $k, >0,ko>0,...,kg > 0. 


and 
k>0<+k2=0, there exists 7 such that k; > 0. 


Let k > 0. For a given variable w = (wy, we, ..., Wa) € R? we define 


we = wh wy? . hd, 
where 0° = 1. 
Example 23. Let d= 5. For multi-indexes, 
k = (3, 6, fe 0, 0), 
= (1,0, 0,8, 0), 
and a variable w = (wi, we, w3, W4, Ws), 
uw = utuful, 
we = ww. 


Note that, if h = (0,0,0,0,0), then 
we =1, 


The following definition is the concept of the standard form. Let q(x) 
and p(a|w) be a true distribution and a statistical model, respectively. 
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Definition 14. (Standard form) Let « €¢ RY and w € W C R®. We study 
a statistical model p(z|w) of x for a given parameter w. Assume that W is 
a compact set which is a closure of some open set. Also we assume that W 
contains the origin and that Wp is the set of parameters which minimizes 
the average log loss function. The log density ratio function and its average 
are respectively denoted by 


fw) = leg ae (5.1) 


/ sar Gadee, (5.2) 


K(w) 


We assume that the log density ratio function has relatively finite variance, 
hence f(x, w) does not depend on the choice of wo € Wo. A set of statistical 
model p(z|w) and a prior y(w) is said to be a standard form if there exist 
functions a(x,w) and b(w) which satisfy 


f(z,w) = wa(z,w), (5.3) 
K(w) = w*, (5.4) 
gw) = |w"| b(w), (5.5) 


where both k > 0 and h > 0 are multi-indexes and b(w) > 0 in a neighbor- 
hood of the origin. 


Remark 28. (Normal crossing function) If an average log density ratio func- 
tion K(w) is represented by w?", then it is called normal crossing. It might 
seem that a very special set of a statistical model and a prior has a standard 
form. However, in the next section, we show almost all statistical models 
and priors such as a normal mixture and a neural network can be made to 
be standard forms by using an algebraic geometrical transform of the pa- 
rameter set. Therefore the statistical theory of this chapter holds for such 
statistical models and priors. 


By the definition and 


K(w) = / (ce) f («, w)der, 


it follows that 
ee | e(e)aCa, wae. 
The set of optimal parameters Wo = {w € W; K(w) = 0} is 
Wo = {w € W;w* = 0}, 
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which is equal to 
Wo = U {weEew; i; = Of, 
pikj>0 
where L;. k,>0 Shows the union of sets for all 7 such that kj > 0. 


Example 24. Let x,w € R' and W = [—1,]1]. A statistical model and a prior 
are 


p(alw) = Fe ex (—(a — 0), 


gw) = |wl?. 
Assume that q(x) = p(x|0). Then wo = 0 and 
f(z,w) = w(w-2), (5.6) 
K(w) = w’. (5.7) 


Therefore the set of a model and a prior is a standard form with k = 1,h = 2 
and 


a(z,w) = w-aZ, (5.8 
bw) = 1, 


Example 25. Let —1 <x <1, y €R, and W = {(s,t) € R?;—-1 < s,t < 1}. 
A statistical model and a prior are defined by 


Ne) 
YS» SE#m 


il 1 
v,y\s,t) = —— ex {-= —sF(a,t a 
p(x, y|s, t) ono a (x, t)) 
y(s,t) = 1/4, 
where 

t 
F(w,t) =< Vo e224 
+1 


Therefore p(z|s,t) is the uniform distribution on [—1,1]. Assume that 
q(x, y) = p(x, y|0,0). Then 


f(a,s,t) = s?F(x,t)*/2—syF(z,t), (5.10) 
K(s,t) = s°. (5.11 
Hence 
a(z,s,t) = sF(x,t)*/2—yF(a,t), (5.12) 
b(s,t) = 1/4. (5.13) 


Therefore the set of a model and a prior is a standard form with k = (1,0), 
h = (0,0). 


5.1. STANDARD FORM 139 


Example 26. Let x = (y, z),w = (s,t) € R* and 
W ={w =(s,8); 8 + < 1}. 


A statistical model and a prior are 


ly.2lsst) = —exp(-(y—s)? - (2-4), 
1 
y(s,t) = a 


Assume that q(x) = p(x|0,0). Then wo = (0,0) and 


f(z,w) = s*4 4? —2ys —2zt, 
Kia) = 2 er. 


This is not a standard form. By using a polar coordinate (r, 6), 


s = rcos8, 


t = rsind, 


where 0 <r <1 and 0 < 6 < 27, the model and prior are rewritten as 


f(a,r,0) = r(r—2ycos@ — 2zsin8), 
Ie) = 
y(w)dw = " drdd. 
T 


Therefore the pair of a model and a prior is a standard form with k = 
(1,0), = (1,0) and 


a(z,r,0) = r—2ycosé— 2zsind, 


b(r,0) = - 
) 


In this case, a true distribution q(x) is regular for a statistical model p(z|s, t), 
but not for p(2|r, 0). 


Lemma 21. Assume that a statistical model p(xz|w) has a relatively finite 
variance and that there exists a C'-class function Kg(w) > 0 which satisfies 


f(z,w) = w*a(z,w), (5.14) 
(5.15) 
gw) = |w"| b(w), (5.16) 


ja 
= 
I 
a 
> 
a 
S 
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where k > 0 and h > 0 are multi-inderes and b(w) > 0 in a neighborhood of 
the origin. Since k > 0, we can assume k, > 0 without loss of generality. 
Also we assume that 

1/2k,—-1 OKo(w) 


= ; 1/2k, ) WL = 
W(dKo) ={weW; Ko(w) ae oT Ko(w) Fi, O} 


is a measure zero subset in W. Then by using 


1/(2k 
uy = Ko(w) /( Dey, 
Uw = W292, 
Ud = Wad, 


the pair of a model and a prior is a standard form of u. 


Proof. The absolute value of the determinant of the Jacobian matrix |Ou/Ow| 
is equal to |Ou,/Ow,| and 


Ou 


Ow 


_ Ko(w)/2" i. 1 Ky (w)'/2h-1 OKo(w) 
2ky Ow, 

Since Ko(w) > 0 is a Cl-class function and W is compact, there exist 

constants A,B > 0 such that 


in K, A 
min Ko(w) > A, 


[Rote | 


B. 
Ow, 


max 
wEew 


Therefore, in a neighborhood of the origin, |Ou, /Ow,| > 0. The map w+> u 
is one-to-one in the set W \ W(0u,/0w,), where W(0u;/Ow 1) is the set 
of all zero points of |Ou,/Ow,|. Thus its inverse function is well-defined in 
W \ W(0u1/0w 1), which is denoted by w = g(u). Then 


f(e,g(u)) = uka(e, g(u))/Ko(g(u))¥?, 
/ ae) f(e,g(u))de = wv, 
Iu*|5(g(u))Io" (w)) 


which shows the pair of a model and a prior is a standard form, where |g’ (u)| 
is the determinant of the Jacobian matrix of w = g(w). O 


y(g(u))|9' (u)| 
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Example 27. Let 0< a <1, y€R, and W = {(s,t) € R?;0<s,t<1}. A 
statistical model and a prior are defined by 
p(e.yls,t) = 
V2n 
p(s,t) = 1, 
Therefore p(x|s, t) is the uniform distribution on [0, 1]. Assume that g(a, y) = 
p(x, y|0,0). Then 


exp(—5(y ~ s tanh(t))?), 


K(s,t) = >-Kolt), 
where 
si an xv 
Ko(t) = f (see ae. 


By the above lemma, the pair of a model and a prior is made to be a standard 
form by 


/ 
Ss = 58, 


t! = t(Ko(t)/2)/”. 


In this case, a true distribution cannot be regular for a statistical model by 
any transform of parameters. 

Example 28. Let N(x) bea probability density function of the normal distri- 
bution whose average and standard deviation are zero and one respectively, 


a 


1 
N(a) = ——exp(——). 
(2) = = exw(-5) 
Let « € R and W = {(s,t);0 < s < 1,|t| < 1}. A statistical model and a 
prior are defined by 
p(2|s,t) (1 —s)N(a) + sN(a—t), 
ls,t) = 1/2. 
Assume that a true distribution is N(x). Then the set of optimal parameters 
is 


Wo=iis,t) 6 W ¢.s—0;.0r f=0}. 
The log density ratio function is 
N(x) 
t) = log a 
F(z, 8,) me 1—s)N(2)+sNi2—?) 
= —log[l + s{e'"—"/? — 1] 
~st (« — t/2) T(ta — t2/2) S(s(e*-"’? — 1)), 
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where two analytic functions are defined by 


Sx) = log(1+2)/z, 
T(x) = (e-I/e, 


with T(0) = S(0) = 1. Hence by 
a(x, s,t) = —(x — t/2) T(tx — t2/2) S(s(e-"/2 — 1)), 


the pair of a model and a prior is a standard form. Note that 


Ki(s,t) = [roucs + exp(—f(z,s,t)) — 1}dx 


[Neve s,t)°U(f (a, s,t))dx 


(x)a(x, s,t)U(f(a, s,t))dx, 


| 
—s 
WD 
Say 
wo 
KK 
a 
8 
WH 
~ 


where 


is an analytic function by defining U(0) = 1/2. In this case, a true distri- 
bution N(x) cannot be regular for a statistical model by any transform of 
parameters. 


Example 29. Let 0 < 2,272 <1, y € R, and W = {(s,t,t2) CR°30<s< 
1,¢? + t3 <1}. A statistical model and a prior are defined by 


1 1 
P(«1, v2, ys, t) = «FR exp(—5(y -_ stanh(t71 —- tox2))”), 
Ol s.ti,00) = 1/a. 


Therefore p(21,22|s,t1,t2) is the uniform distribution on [0,1]?. Assume 
that q(x1, U2; y) a p(x, v2, y|0, 0, 0). By using 


Ss = §&, 
t) = rcos6, 
to = rsind, 
whereO<r<l1,-17<6<17, 
2,2 
K(s,7,0) = Kol, 6), 
(1/m)dsdtidtg = (r/m)dsdrdé, 
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where 


1 iL h in 0 2 
Kolr.6) = / (= {r(#1 cos 6 + x2 sin J) ae 
0 Jo 


r 


The pair of a model and a prior is made a standard form. 


Example 30. A neural network which has H hidden units is defined by 


H 


plavuilsnts}) = = exp(—f(y— Yo ss tanh(2))). 


2 
h=1 


In this case, in order to find a transform which makes the model a standard 
form, we need the method explained in the next chapter. 
Remark 29. Almost all statistical models and priors used in practical ap- 
plications can be made to be standard forms by choosing an appropriate 
function, 

w= g(). 


Then a statistical model and a prior are rewritten as 


p(x|g(u)) 


p(x|w) 
e(u)|g'(u)|du, 


y(w)dw 


where |g'(u)| is the absolute value of the determinant of the Jacobian matrix, 


Ow 
-o4(22)) 
| Nu 
From the view point of Bayesian statistics, (p(z|w), p(w)) is equivalent to 
(p(x\g(u)), p(w)|g’ (u)|). In other words, the free energy and generalization, 


cross validation, and training losses are invariant by w = g(u). A method 
to find the appropriate function w = g(wu) is discussed in the next chapter. 


Ig‘ (u) 


In order to study statistical estimation, we need additional mathematical 
conditions. 


Definition 15. (Mathematical condition) (1) The set of parameters W is 
a compact set in R?@ and the closure of its largest open set is equal to W. 
(2) A statistical model is a standard form and a(x,w) defined in eq.(5.3) 
satisfies that for an arbitrary s > 0 and arbitrary multi-index k > 0, 


/ sup |(0/Ow)* a(x, w)|*q(x)dx < 00. 
wEew 
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To construct statistical theory for a standard form, we need stochastic 
process because Wo is not a single element. 


Definition 16. Assume that a statistical model is a standard form. A 
stochastic process €,(w) is defined by 


w es ; w* — a(Xi,w 
En(w) Fao (Xi, w)f- (5.17) 


By this definition and the assumption that a statistical model is a stan- 
dard form, it follows that 


ElEn(w)| = 0, (5.18) 
Elén(w)én(u)]) = Exla(X,w)a(X,u)] — wu*. (5.19) 


On the other hand, the Gaussian process €(w) on W that satisfies 


Eele(w)] = 0, (5.20) 
EelE(w)E(u)] = Egla(X,w)a(X, u)] — w*u*, (5.21) 


is uniquely determined. If €,,(w) satisfies conditions of Definition 15, 


lim E[F(&,)] = E¢[F(8)] 


noo 


holds for an arbitrary continuous and bounded functional F( ), where F' is 
a function from a set of functions 


{f(w) ; sup |f(w)| < co} 
wew 


to R. Moreover, for an arbitrary s > 0, 


lim E[sup |&,(w)|*] = Eg[ sup |€(w)|*]. 
NCO WEW wEew 


The mathematical background of these properties is explained in Section 
10.4. 


Example 31. Let {X;;i = 1,2,...,n} be a set of independent random vari- 
ables which are subject to the uniform distribution on [—1, 1]. For a param- 
eter a (0 <a < 27), an empirical process €,,(a) is defined by 


Low, 
Eqka) = Th 2 Pnlan 
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Empirical Process 


n=10 n=100 


Figure 5.1: Examples of empirical processes are illustrated for n = 10 and 
n = 100. In the standard theory, the set of the optimal parameters Wo 
consists of a union of several manifolds, empirical process theory is necessary. 


For each a, €,(a) converges to a normal distribution whose average is zero 
and variance is Ex(sin(aX)?]. See Figure 5.1. Moreover, &,(a) converges 
to a Gaussian process as a random process. Here a random process is a 
function-valued random variable. 


Theorem 8. Assume that a set of a statistical model and a prior is a stan- 
dard form. The log likelihood ratio function is defined by 


Ky(w) == (Xu). 
i=1 


Then 
nK,(w) = nw** —J/nw* &,(w), (5.22) 


where Ep(w) satisfies the convergence in distribution E,(w) > €(w). 


Proof. This theorem is shown by applying definitions f(z,w) = w*a(z,w), 
K(w) = w*, and eq.(5.17) to 


nKy(w) = nK(w) —n(K(w) — K,(w)). 


Then by using the empirical process theory in Section 10.4, we obtain the 
theorem. oO 
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Example 32. For a nonregular case in Example 25, €,,(s,t) is given by 
1 n 
En(s,t) = wa Sfe=eF Aut 24 YP Xe). 
i=1 


Then 
nK,,(s,t) = ns* — /n s €,(s,t). 


5.2 State Density Function 


Let K(w) and y(w) be the average log density ratio function and a prior of 
w € W CRY respectively. The posterior distribution has a form 


exp(—nK (w)) p(w), 


which is the Laplace transform of the state density function 


d(t — K(w)) p(w), 


where t > 0. The asymptotic behavior for n — co corresponds to t + +0, 
hence in this section we study the state density funciton and its asymptotic 
behavior. 

Assume that a pair of a statistical model and a prior is a standard form. 
Then on the positive region of parameters, 


W1, Wa, ..,Wg > 0, (5.23) 
the state density function is 
d(t — K(w))p(w) = 4(t — w)|w" |b(w)x(w), 


where x(w) is the characteristic function of the positive region of parameters, 


— 1 (wy, W2,---,Wa > 0) 
x(w) = { 0 (otherwise) ae 


For general cases other than eq.(5.23), 6(t — K(w))y(w) can be treated by 
using this case. See Remark 33. 

Let us show that the asymptotic bahavior of the state density function 
is determined by the real log canonical threshold and its multiplicity deter- 
mined by the multi-indexes k and h by the following definition. 
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Definition 17. (Real log canonical threshold and its multiplicity) Let k > 0 
and h > 0 be d dimensional multi-indexes. Without loss of generality, we 
can assume that 


h,i+1 
{ aes pA ei aah 
2h; 
is a nondecreasing sequence of 7, where, if k; = 0, we define 
hjtl fh 
= +00. 
2k; 


Then by definition, there exist a positive real value \ > 0 and a positive 
integer m (1 <m < d) such that 


es 
Dk, Ds Dom 
and 
Le’ Gein a: 


j 
Then the constants \ and m constitute a real log canonical threshold and 
its multiplicity. The redundant multi-index pp = (Jm41,--.; la) € RO™ is 
defined by 
fy = —2Akj thy (G=m+1,...,d). (5.25) 


By the definition, wu; > —1. If m =d, then p is the empty set. 
Example 33. For the case 
k = (31,4,1,2,0,0), 
hb = (20:3, 1,501). 
elements of the set {(h; + 1)/(2k;)} are given by 
2+1 0+1 341 141 541 041 1/1 


which is a nondecreasing sequence. Therefore \ = 1/2 and m = 3, resulting 
that 


(La, M5, H6, Lz) = (0, 3, 0, 1). 
Definition 18. Let k > 0 and h > 0 be arbitrary multi-indexes. Let \ and 


m be the real log canonical threshold and its multiplicity, respectively. The 
constant C(k,m) is defined by 


C(k,m) = 2” (ma — 1)! Ty. (5.26) 


148 CHAPTER 5. STANDARD POSTERIOR DISTRIBUTION 


We use a notation, 
w= (Wa, wp), 


where 
Wa = (Wy, Wa, -«-) Wm); 
Wa = (Wm+1)Wm42) +) Wa): 


Then a function f(w) is rewritten as f(w) = f(wa, wg). A function (hyper- 
function or distribution) D(w) is defined by 
1 


D(w) = Cia 5(Wa) [wy] b(w)x(w), (5.27) 


where jz is the redundant multi-index defined by eq.(5.25) and 6(wa) is 
defined by 


5(wa) = ] ] 6(wy). 


a. 
ll 
mn 


If a function ~(w;) is not continuous at w; = 0, then 


filo) de 


is not defined. In such a case, we adopt a generalized delta function 


in the definition of D(w). 
Example 34. For k and m in Example 33, the constant C(k,m) is 
C(k,m) = 23(8 — 1)! 3-1-4 = 192. 
By m= 3, 
tig. = ‘is tie, Ws), 
Wg = (w4,Ws5, We, W7). 


Then a function f(w) is rewritten as f(w) = f(wa,wg). A function (hyper- 
function or distribution) D(w) is given by 
1 


5(w1)6d (we )d(ws)| (wa)? (ws)? (we)? (w7)"|b(w)x(w). 


Since pp; > —1, Jy, D(w)dw is a finite value if W is compact. 
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Theorem 9. Assume that a pair of a statistical model and a prior is a 
standard form and that A and m are the real log canonical threshold and its 
multiplicity. Then the following asymptotic expansion holds for t + +0, 


5(t— w*)|w"|b(w)x(w) = t*\(— log t)"™*D(w) 
sole lest"), (5.28) 


where x(w) and D(w) are defined in eq.(5.24) and (5.27) respectively. 
Proof. For an arbitrary C'-class function ~(w), we define 
(t) =f 5(e— w?) fo" (uw) x(w) (wee 


The Mellin transform of u(t) is defined by 


Arie / * u(t)? dt, 


where z € C. Then by the definition, 


(Mo)(2) =f w2|eh|o(w)x (wpa 


Since W is compact, there exists D > 0 such that 
(Mv)(z) = i w?*= [av |b(w)ab(w) dw. 
[0,D]4 


By the mean value theorem, there exists w* such that |w*| < |w| and 
b(w)b(w) = 60, wa)W(0, we) + D> wj(O/dw;)(b(w*)v(w*)). (5.29) 
j=l 
A complex function of z € C defined by 
Me p2kjz-+hj+1 


2kz|,h 
d = et 
-_ a [wa i=1 2k jz a hj +1 


m ae (+) _ 1) 
7 mr == 


kj 
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has a pole at z = —A with the order m. Therefore by using eq.(5.29), 


1 7 
(MeN) = Eon (I gg) J yun U0: 449) 00st) deg 


j=l 


+0(—yt): 


Note that the integration of the first term in the right hand side is finite 
because the redundant index satisfies 4; > —1. Then by using the inverse 
Mellin transform shown in Remark 30, 


A=1 m—1 ,m 
v(t) = ee (1 a) howe wib(0, wa) b(0, we)dwe 
LO (P-1(- loge”), 
which completes Theorem 9. O 
Remark 30. The Mellin transform of 


= { Pelee" (o<' <1) 


0 (otherwise) 
is given by 
(m— 1)! 
M. pons as oe 
(Myy(e) = M$, 


which can be derived by the recursive partial integration. 


Remark 31. Theorem 9 shows the following fact. The state density function 
6(t — w?*)|w"| is not a well-defined hyperfunction when t > +0. However, 
it is asymptotically given by the well-defined hyperfunction D(w), 


5(t — w*)|w"|b(w)x(w) = 1 (— log t)"" D(w), 
and its asymptotic behavior as a function of t is determined by the real log 


canonical threshold 4 and its multiplicity m. 


Example 35. Let us study a state density function 
5(t — 2°) |a|?b(x)x(z), 


where b(x) > 0. This case corresponds to k = 3 and h = 2 in the above 
theorem. Hence \ = (h + 1)/(2k) = 1/2 and m = 1. The redundant index 
is empty. 

C(k,m) = 2™(m — Ilk = 6. 


5.2. STATE DENSITY FUNCTION 151 


By Theorem 9, as t > +0, 
4-1/2 
6(t — x°)|x7|b(x) x(x) = 5 5(#)b(0) + o(t~/?). 


Example 36. Let us study a state density function on 


d(t — ary e 2°)|a! y?2°|b(x, y, 2)x(2,Y, 2), 


where b(z,y,z) > 0. The multi-indexes are k = (2,3,4) and h = (1, 2,6), 
hence 


carne s 
which shows \ = 1/2 and m = 2, resulting that 


_flt+i1 241 641 
d= min{ i 


C(k,m) = 2 (m — 1)!ky - kp = 2? - (2—1)-2-3 = 24. 


The redundant multi-index is wz = —2-(1/2)-4+6= 2. By Theorem 9, as 
t— 0, 


5(t — aty®z®)|ay?z°|b(x,y, z)x(x,y, 2) 


~ ata log t) 5(a)5(y)z70(0, 0, z). 
Remark 32. Let K(w) and y(w) be an average log density ratio function 
and a prior, respectively. The zeta function is defined by 


a= | K(w)* w)dw (z€C), 


which is equal to the Mellin transform of the state density function. Then 
K(w) is an analytic function on Re(z) > 0, which can be analytically contin- 
ued to a meromorphic function whose poles are all real and negative values. 
Let the largest pole of ¢(z) be (—A) and its order be m. Then A and m are 
equal to the real log canonical threshold and its multiplicity, respectively. 


Remark 33. In the above definitions, it is assumed that W is contained in 
(0, Die for some D > 0. For the other cases, we can use the same procedures. 
Let o = (01, 02,...,0q), where o; = 1 or o; = —1. The set of all such variables 
is denoted by Ng. For o € Xq, We define 


ow = (01 W1, 02W2, ..., TdWa)- 
2 | = 


Note that for an arbitrary 0, 0 |w*|. Also we define a 


set of parameters a(S) by 


w = w and jow 


o(S) ={ow; we S}. 
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If W C a([0, D]@), we define a generalized version of D(w) by 


Daw) = = : ay Blow) jw] Bw)x(ow). (5.30) 


Then 
d(t — w™)|w" |b(w)x(ow) 
= d(t — (ow)”*)|(ow)"|b(w)x(ow) 
log ty" Dew) oS lost), 
Therefore, the asymptotic form of the state density function for the case 


oW c [0,D]¢ results in Theorem 9. If W Cc [—D, D]?, the state density 
function is the sum of the 


d(t — w*)|w" |b(w) 
= Do 6(t — w”*) jw" |b(w)x(ow) 


odd 


—log t)™ aOR w)) +0 (1 (— log ty"). 


o€hg 
By applying Theorem 9, the state density function is represented by the sum 
of o € Xq. Note that 
w* 5(t — w**)|w"|b(w)x (ow) 
= (o*) -1/(— log t)™-! D5 (w) + oft? 1/2 (— log t)"4). 


5.3 Asymptotic Free Energy 


In this section, we study the asymptotic behavior of the free energy or the 
minus log marginal likelihood. The normalized posterior function Q(w) is 
defined by 


w) [Jw 


QAK(w) = ———__. (5.31) 
[[ C40) 
i=1 


This function is in proportion to the posterior density as a function of the 
parameter w. The posterior density is an exponential function of n, whereas 
the normalized posterior function is in proportion to (log n)™~!/n>, where » 
and m are the real log canonical threshold and its multiplicity respectively. 
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Theorem 10. Assume that a pair of a statistical model and a prior has a 
standard form and that A and m are the real log canonical threshold and its 
multiplicity, respectively. If the set of parameters W is contained in [0, D]¢, 
then 


(uy = REE Dew) [at Ot ex(t + VE E(w) 
a), (5.32) 


where €,(w) and D(w) are functions defined by eqs. (5.17) and (5.27), 
respectively. 


Proof. If w € W Cc [0, D}¢, by using eq.(5.22), 
Q(w) exp(—nK,(w)) p(w) 
exp(—nw** + J/nw*é,(w))|w"|b(w) 
[va o(t 2k) lw" | exp(—nt + Vnt €,(w))b(w)dt. 


By replacing t := t/n and by Theorem 9, it follows that 


Aw) = i. O65 (= — w?*) jo" exp(—t + VE Eu (w)) 6(w) 


7 [ ow (A) (tog(2))" DC) exp(—t + VF &n(w)) 
+0p((logn)™~1/n). 


Since 
(os(*))"~ = (logn)™—! + O((log n)™~), 


we obtain Theorem 10. Oo 


Theorem 11. Assume the same condition as in Theorem 10. Then the free 
energy or the minus log marginal likelihood 


Fy = log f olw) [] p%ilu)de 
i=1 


has the asymptotic expansion 
F, = nLy(wo) + Alogn— (m— 1) log logn 
— log ([ wow fe dt t+ exp(—t+ vt En(w))) + o,(1), 
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where 


Proof. By the definition of Q(w) in eq.(5.31), 


n 
i = a ) TT» (X;|wo) ” 
= — log (fo 
Combining this equation and Theorem 10 completes the theorem. O 


Remark 34. If W is contained in [—D, D], then by Remark 33, 


5(t — w™*)|w"|b(w) 
=t*")(—logt) om Da( (w)) + o(0 ‘(log t)™"'). 


o€Xag 


Therefore, 


F, = nLy(wo) + Alogn — (m — 1) loglogn 
— log es [ ewDdalw fr dt t+ exp(—t+ vt o* &n(w))) + Op(1). 


o€dg 


In other words, the main order terms are same as in Theorem 11, whereas 
the constant order term is different. 


Example 37. For a triple of a statistical model, a true distribution, and 
a prior given in Example 27, A = 1 and m = 2. A numerical result of 
F,, — nLy(wo) is shown in Figure 5.4. In general, 4 and m depend on a 
true distribution, hence the Theorem 11 cannot be employed if we do not 
know the true distribution. However, by using the mathematical property 
of F,, we can derive the information criterion WBIC by which F;, can be 
estimated without information of the true distribution (Chapter 8). 


5.4 Renormalized Posterior Distribution 


In this section, we derive the asymptotic behaviors of the generalization, 
training, and cross validation losses and WAIC. By using Q(w) in eq.(5.31), 
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the posterior expected value of an arbitrary function y(w) is rewritten as 


(5.33) 


This is the definition of the posterior average. However, the behavior for n > 
co cannot be derived directly from this definition. By using the standard 
form, the posterior distribution can be represented by a product of a function 
of n and the fluctuation function. The renormalized posterior distribution 
is neceswary for the fluctuation part. 


Definition 19. (Renormalized posterior distribution) Assume the same 
condition as Theorem 10. For an arbitrary function z(t, w), the renormalized 
posterior distribution is defined by 


[ eworw fe dt z(t,w) t?—! exp(—t + Vt &,(w)) 


(2(t,w)) = 
[two fe dt! exp(—t + Vt €n(w)) 


(5.34) 


For a given function f(w), the expectation operator (f(w)) depends on 
En(w). If the function €,(w) must be explicitly represented, a notation 


(F(w)) = (F(w))en (5.35) 


is used. Note that (f(w)) is a random variable. 


Remark 35. By using notation w = (wa, wg) in Definition 18, 
D(w) = D(wa, wy) « 5(wa)wy, 


the posterior average can be rewritten as 


i 7h dt 0-1 ett Vt &n(0,ws) 


Hence, the renormalized posterior distribution does not depend on the set 
of values {z(t, Wa, wg); |Wa| > O}. The set {(0,wg)} is contained in the set 
of the optimal parameters Wo. In other words, the renormalized posterior 
distribution is the probability distribution on Wo. 


(5.36) 
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Remark 36. The renormalized posterior distribution is defined for the case 


when W c [0,D]¢ for some D > 0. For general cases in Remark 33 it is 
defined by replacing 
/ dw D(w 


The same theorems and lemmas hereafter hold. 


)r¥ of dud, (w). 


o€ha 


Theorem 12. The renormalized posterior distribution satisfies 
1 
(t) =A + 5(Vt En(w)). (5.37) 


Proof. By the definition, 


dt dwD(w fr dt th @ttvi &n(u) 
a (5.38) 
Panto fr dt Po} etVEEwtw) 
By applying the partial integration over ¢ and using A > 0, 
a e tr evién(~) dt 
0 
_._ LaAwe) | ae as rn 
le te l, +f € at )dt 
— afe eter le Vien) ae 
+ f° etPreFnen(w)/(2Vdt 
0 


By applying this equation to eq.(5.38), we obtain Theorem 12. O 


There are asymptotic relations between the posterior distribution and 
the renormalized one, by which the behaviors of the log density ratio function 
and its average 


(fv) = Jog AD — whale), (5.39) 
Ky = / Ham eeaan: (5.40) 


are clarified. 
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Theorem 13. (Scaling law) Assume the same condition as used in The- 
orem 10. For an arbitrary positive integer s, 


Bulf(e.w)] = a (Veale) )+o(<5), 6-41) 
Bul K(w)’] = —e? ) + op(). (5.42) 


Proof. Let us prove the first half. The latter half can be proved in the same 
way. 


Let N(s) be the numerator of this equation. Then N(0) is equal to the 
denominator and 


Nig) = [ww w**a(a, w)§ exp(—nw?* + /nw*é,(w))w"d(w) 


= [ dt / dw t9/2a(a, w)*b(w)d(t — nw?”)w" exp(—t + VtEn(w)) 
0 


7 (log n)™* [ at f dwD(wa(e, wes exp(—t + Vt &,(w)) 


mr>ts/2 


+o So) 


where we used eq.(5.32) in Theorem 10. Therefore N(s)/N(0) satisfies the 
first half. O 


Remark 37. Let € > 2. Assume that a statistical model and a prior have 
the standard form. Then we can prove 


snp |(q) Sole)] = Onlgza) 
sup |(z5) To] = Onl) 


in the same way as we proved Theorem 5. In fact, 


fa) = w*a(a, w). 
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Hence 
f(z, w)| < |w*|la(a, w)]. 
By using Lemma 8, 
Ew ,w)|* exp(—a ,w 
(G)'Sto] < oes coo 


— w)/pEwllt exwcaf (x, w))] 
= Cj -x [{sup |a(X, I Ewlexp(—af (X, w))] | 


= O,(1/n"?). 
In the same way, 
d\é 1 Ew(|f(Xi, w)|f exp(—af (Xi, w))] 
le Talo)| = cr Ewlexp(—af (X;, w))] 
= 0O,(1/n*/?). 


Therefore, we can apply the basic Theorem 8 also to this case. 


Definition 20. By using the renormalized posterior distribution, the fluc- 
tuation of the renormalized posterior distribution Fluc(€,,) is defined by 


Fluc(é,) = Ex[(ta(X,w)’) — (vta(X,w))?]. 


Here Fluc(&,) is a functional of €,, because the expected value using the 
renormalized posterior distribution ( ) depends on &,. 


Theorem 14. Assume the same condition as in Theorem 10. Then 


Gn = L(wo) += (A+ 5(ViEa(w)) — 5FIme(E)) +op(—), (5.43) 


n 


Tr = In(ao)+~(A- 5(ViEn(w)) — 5FluclEn)) + op(=), (6-44) 
Cn = En(wo) +=(2— 5 (Vien (w)) + 5FluclEn)) + op(=), (6.45) 
Wn = Ln(uo) += (A- 5(viEa(w)) + 5FluclEn)) + op(=), (5-46) 


Proof. By the Theorem 3 and the above lemma, the generalization, cross val- 
idation, and training losses are obtained by calculating G/,(0), G)’ (0), 7,,(0), 
and 7,"’(0), by applying the scaling law. Using eq.(5.37), 
G,(0) = L(wo) + Ew[K(w)] 
i 
= Lup) + =(t) + 0p(1/n) 


= Luo) += (+ F(ViEn(w))) + op(1/n), 
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and 
Gi(0) = Ex(Ey[f(X,w)*] — Ewlf(X, w)]}?] 
= AB [lta x, w)%) — (Vea(X, w))2] + op(1/n) 
= ~Finc(E,) + op(1/n). 


Hence we obtained eq.(5.43). 

Tr(0) = Lnl 
nf 
= Ln(ovo) += (A~ (5 ViEn(w))) + op(1/n), 


)+E, [kn (w)] 
) + A(t — Vien (w)) + op(I/n) 


Lin (wo 
= Ly(wo 


and 


n * 


TiO) = —S-{Ewlf(X:,w)?] — Eulf(X,0)?} 
i=1 


= > {tial Xi, w)) — (Vta(X;,w))?} + op(1/n). 
4=1 


By using the law of large numbers for a functional case, 


n 


~ Sa X;,w)a(Xi,0) — Ex [a(X,w)a(X,0)]] = op() 
i=1 


sup 
W,v 


the difference between nG!’(0) and n7,"(0) goes to zero in probability when 
n— co. Thus eq.(5.44) is obtained. O 


Remark 38. The standard deviations of T,, Cr, and W,, are O,(1/,/n), 
because the standard deviation of L,,(wo) is Op,(1/,/n). The averages and 
standard deviations of the four random variables G, — L(wo), Tn — Ln(wo), 
Cr —Ln(wo), and W,, — Ly(wo) are O,(1/n). They are asymptotically given 
by the linear combination of two random variables (Vt£,(w)) and Fluc(€,). 
It should be emphasized that Theorem 14 holds even if the true distribution 
is not realizable by and singular for a statistical model. 


Remark 39. By Theorem 14, the convergences in probability hold, 


n(Gp, — L(wo)) + n(Cp — Ln(wo)) > 2d, 
n(Gy, — L(wo)) +n(Wr — En(wo)) > 2A, 
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where A is the real log canonical threshold. That is to say, for a given 
triple (q(x), p(z|w), p(w)), if (Cn — Ln(wo)) is smaller then (G,, — Ln (wo)) is 
larger. W,, has the same property as Cy. This is the generalized version of 
Remark 25 in regular theory. These properties are very important when we 
employ the cross validation loss and WAIC in statistical model selection and 
hyperparameter optimization. If a sample consists of independent random 
variables, then it automatically follows that E[C,,] = E[G,,_1] by the defini- 
tion of the cross validation loss. However in order to derive the variance of 
the cross validation error C,, — L,(wo), we need mathematical theory. 


Lemma 22. Let €(w) be the Gaussian process which is uniquely character- 
ized by eg.(5.20) and eq.(5.21). Then 


Bel(Vte(w))] = Ee[Fluc(€)]. 
Proof. Since E[G,_1] = E[C;,,], by the above theorem, 


Ze, ((ViEn(w))] = Ee, [Fluc(én)] + 0(1). 


As n > 00, &)(u) > €(u) in distribution, which completes the lemma. O 


Remark 40. Lemma 22 is shown by the relation between expected values 
of generalization and cross validation losses, based on the assumption that 
X 1, X2,...,Xn are independent. However, the condition of independent ran- 
dom variables is not necessary for Lemma, 22. In fact, it holds in some cases 
when X1,X9,...,Xy are not independent. See the next section. 


Definition 21. The constant 
2v = Eel(VtE(w))] = E¢[Fluc(€)] 


is called the singular fluctuation. 


Theorem 15. Let \ and v be the real log canonical threshold and the sin- 
gular fluctuation. Then the averages of the generalization loss, the cross 
validation loss, the training loss, and WAIC are asymptotically given by 


E[Gn] = L(wo) tA 40(2), (5.47) 
:(T,] = Luo) + (4-27) + 0(-), (5.48) 
nC,| = Lup) +2 + 0(=), (5.49) 
Wal = Lene o +o: (5.50) 
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Proof. By Theorem 14 and Lemma 22, we obtain Theorem 15. O 


Definition 22. The functional variance is defined by 


Vn =~ S-{ Bu [log p(Xilw)?] — Ellog p(X;lw))?}. 
i=1 


Lemma 23. When n— co, nE[V,,] > 2v. 


Proof. Since log p(X;|wo) is a constant function of a parameter w, the ran- 
dom variable V,, can be represented by 


Vn = =) { Eulf(Xi,w)?] - Eulf (Xi, w)?}. 


i=1 


The asymptotic equivalence of V,, and Fluc(€,,) was shown in the part 7,7’ (0) 
in Theorem 14. O 


Theorem 16. ‘Equations of states in Bayesian statistics) Assume that 
a statistical model has a standard form. 


Gal = a[Cn] + o(=) (5.51) 


EG,] = z[Wa] + 0(-). (5.52) 


These equations hold even if a true distribution is singular for or unrealizable 
by a statistical model. 


Proof. This theorem is immediately derived from Theorem 15. O 


Remark 41. If the true distribution is regular for a statistical model, then 
the higher order equivalence can be derived. See Chapter 8. 


Remark 42. Both C,, and W,, are estimators of G,,. In typical statistical 
inferences, the difference of them is very small. If X1, Xo,..., Xn are indepen- 
dent, then both eq.(5.51) and eq.(5.52) hold. If X1, X2,..., Xn are dependent, 
then it is not ensured that eq.(5.51) hold. Under some conditions such as 
conditonal independence, eq.(5.52) holds. Therefore, even if X1, Xo,..., Xn 
are dependent, if eq.(5.52) holds and C;, is asymptotically equivalent to W,,, 
then also eq.(5.51) holds. See the next section. 
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5.5 Conditionally Independent Case 


In this book, we mainly study a case when a sample X” consists of random 
variables which are independently subject to the same probability distribu- 
tion. However, there are several important cases when such an assumption 
does not hold. In this section, we study a conditionally independent case. In 
this situation, the differences between the cross validation and information 
criteria are clarified. 

In this section, we study a case when 


= (Pi 29, <x) 
is fixed or may be dependent. The random variables Y" = (Yj, Ya,..., Yn) 


are independently subject to a true conditional distribution 


n 


T[ a@ilas). 


i=1 
Then (Yj, Yo,..., ¥,) are independent, but ((x1, Yi), (v2, Y2),..., (@n, Yn)) are 
dependent. 


Remark 43. The conditionally independent condition allows the following 
cases. 


1. The set (21, 2%2,...,2%n) consists of fixed points. 
2. The set (x1, 2%2,...,2n) is a time sequence. 
3. The set (x1, 22,...,%n) is not independent. 


Note that, in such cases, prediction and estimation are different from each 
other. In general, prediction for a new point 7,4, has no meaning, whereas 
the estimation of q(y|x;) using p(y|x;, w) is well-defined. 


We define a statistical model and a prior by 


P(yil zi, w), p(w). 


For a given sample (x”, Y”), a posterior density is defined by 


n 


plwla”, ¥") = p(w) T] piles, 


uy i=1 
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where Z,, is a partition function or the marginal likelihood, 


n 


n= | ow) [pila w)aw, 


i=1 


which can be understood as the estimated probability density function of 
Y”. Also we define a Bayesian estimation of the conditional distribution by 


ply|zi, ae ae = Ly [p(ylzi, w)). 


This predictive distribution is not a prediction for a new point but an esti- 
mation of Y at the trained point x;. A prediction for a new point x can be 
defined by the same equation, 


ply|z, ae Y") = Ew p(ylz, w)], 


however, its generalization loss cannot be defined because there does not 
exist a probability distribution of x. The generalization loss, the cross vali- 
dation loss, the training loss, and WAIC are defined by 


1 n 
ey = 2 | alules) tox p(yles.2”.¥")ay, 
i=l 
1 n 
n= 2S bene) 
i=l 
1 n 
i=l 


1 n 
Wr = Tn, oe Voll ¥ to : 
aoe flog p(¥j|xi, w)] 


Also in this case we can construct Bayesian theory for both regular and 
standard cases. For example, eq.(3.20), eq.(3.21), and eq.(3.23) in Theo- 
rem 3 hold, resulting that eq.(5.43), eq(5.44), and eq.(5.46) in Theorem 14 
hold. However, there are several different points from the independent case. 
Because x” is not subject to the same probability distribution, the average 
generalization loss is not equal to the cross validation loss, 


E|Gn-1] # EC,]. 


In other words, if x” is dependent, then we lost the relation between the 
generalization loss and the cross validation loss. Moreover, the asymptotic 
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expansion of C;, in eq.(3.22) in Theorem 3 does not hold in general, 


Cn = Ta(-1) # TE (0) + 5TH (0) + op), 


n 
Consequently, eq.(5.45) in Theorem 14 does not hold. 
1 1 1 1 
Cn # Ln(to) + =(A—- 5(viEn(w)) + 5Fluc(En)) + op(=). 
Remark 44. (Numerical problem of importance samping cross validation) 
In conditionally independent cases, the cross validation loss has the other 


problem. Even in a conditionally independent case, the cross validaiton loss 
C;, satisfies 


1 
Cx — ra S > log Hae | 1 / pl Yi lay, w)], 
i=1 


which can be numerically calculated by using posterior parameters, hence 
it is called the importance sampling cross validation. However, if a leverage 
sample point (a;, Y;) is contained, in other words, if p(Y;|x;, w) is very small 
for a posterior parameter w, then the posterior average E,,{1/p(Y;|x;, w)| 
diverges. Therefore eq.(3.22) in Theorem 3 does not hold in general. 


Since E[G,,_;] 4 E[C,,], Lemma 22 cannot be derived via the cross val- 
idation loss. However, we can prove the following Lemma 24, by which 
eq.(5.47), eq.(5.48), and eq.(5.49) in Theorem 15 and eq.(5.52) in Theorem 
16 hold. However, neither eq.(5.49) in Theorem 15 nor eq.(5.51) in Theorem 
16 holds. In other words, the cross validation loss cannot be employed in 
conditionally independent cases whereas WAIC can be. 

Let us show theoretical strcutures of the conditionally independent cases. 
Let Wo be the set of parameters which minimize 


1 nm 
L rs a I ay d ’ 
(w) =F | alolad tos rtulas.w)t 
and wo bea parameter contained in Wo. Two functions f(x;, y, w) and K(w) 
are defined by 


p(y|zi; Wo) 
p(y|xi, w) 


Kw) = 23> f alvin) fein. w)ay 
i=l 
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If a set of a statistical model and a prior is a standard form, there exist 
a(z;,y,w) and b(w) such that 


f (zi, y, w) =— wa(zi,y,w), 
K(w) = w*, 
g(w) = |w"| b(w), 


where both k > 0 and h > 0 are multi-indexes and b(w) > 0 in a neighbor- 
hood of the origin. A stochastic process €,(w) is defined by 


w = : w* — a(z;, Yi, w 
En(w) Fao (x3, Yi, w)}. 


Note that €,(w) is an empirical process composed of dependent random 
variables in general. It follows that 


Elén(w)] = 0, 


Elen (w)én(u)] => Ey [a(ai, Y,w)a(2;,Y,u)] — wru®. 
i=1 


A Gaussian process €(w) which satisfies the following conditions is uniquely 
determined. 


Re[E(w)| = 0, 


Zele(w)e(w)] = + >> Byla(a,¥, waln;,¥,u)] — wih 
i=1 


Then (Vté,(w)), (VtE(w)), Fluc(€,), and Fluc(€) are defined in the same 
way. We can derive the following lemma without using the cross validation 
loss. 


Lemma 24. The same statement as Lemma 22 holds, 


Bel(VtE(w))] = Ee[Fluc(é)]. 


Proof. In this proof we use a notation E¢| | = E[ ]. Let us use a decompo- 


sition of the Gaussian process, 


E(w) = D7 95&)(w), 
j=l 


166 CHAPTER 5. STANDARD POSTERIOR DISTRIBUTION 


where {g;} is a set of random variables which are independently subject to 
the standard normal distribution (0,1). If K(g(w)) = 0 then w* = 0 and 


rE(w)E(v)] = — SE y[a(ai, ¥, w)a(ai, Y,u)] = $06 (w)&(v). (5.53) 


i=1 j=l 


A Gaussian process satisfies 


a P(g;)| 


Bla Flai)l = Elz 


for an arbitrary integrable function F( ). Let S be the integration operator 
defined by 


SI }= [ aww) f° Pp I 


Then 


_f ee sues) 
Slexp(vig)] 


Sf 8 Silvie; exp(Vt8)] 
=D laa, | Sexi) ) 


ppp SltFew(Vigl] _. pSIvigj exp vig)] 2 
=D seawer | "secre |S 


The last equation is equal to Fluct(€), which completes the lemma. O 


Remark 45. For a finite n, if the convergence in distribution €,,(w) > €(w) 
holds, then 


Re, [(ViEn(w))] = Eg, [Fluc(é,)] + 0(2): 


Therefore, the above equation also holds asymptotically. In the proof of 
Lemma 24, the partial integration over the functional space is effectively 
employed. Therefore, we understand that in independent cases, the cross 
validation is mathematically equivalent to the partial integration over the 
functional space. 

Example 38. Let us illustrate an example in which the convergence €,,(w) > 
€(w) holds in a conditionally independent case. Let y € R, x € RY, and 
w € R¢. We study a case when a statistical model p(y|x, w) is given by 


piylew.s) = \/ Xexpl-(o/2)(y - Fla,w))?), 
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where F(z,w) is a function from R‘ x R¢ to R. Assume that x” is a set of 
fixed points and Y” is taken from p(y|x;, wo, 80) for some true wo and So. 
Then 


f(%i,y, W, 8) = (s/2)(y — F(a;,w))? am (1/2) log s 
—(80/2)(y — F(wi, wo))” + (1/2) log so, 

and 

1 n 
K(w.s) = >> f rlvleswo,s0)f(ei.us, say 

= 

Hence if a statistical model is a standard form, 

k _ 89 — 8 i 2 
wheats) = SELL — Flos wo)? — 1/5} 
+= (Vi — F(a, w))(F(@i,v) — F(@i,0)), 


which converges to a Gaussian process as n — oo, because {Y; — F'(x;, wo)} 
is a set of independent random variables which are subject to the same 
probability distribution. 


Example 39. (Influential observation) A statistical model and a prior are 
defined by 


rlylesa,s) = [2 exp(-(6/2)(y ~ a2)? 
las) x sexp(~(s/2)(o+ ne), 


where hyperparameters are set as wp = p = 0.01. The true conditional 
density is defined by q(y|x) = p(y|x, wo, so), where wo = 0.2 and s9 = 100. 
The set of inputs is given by 


a = 01% (6=1,2,...,n—1), 
Ir = R, 


where n = 10 and Y” are independently taken from q(y|z;). The inputs 
of data 0 < a; < 1 for i = 1,2,..,n — 1, whereas the last input x, is 
a leverage sample point because it is set as R = 1,2,3,4,5. Figure 5.2 
shows the generalization, the cross validation, and WAIC errors for given R, 
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Place of the leverage sample 


Figure 5.2: Influential observation. The generalization, cross validation, 
and WAIC errors are compared for the case when a leverage sample point 
is contained. The horizontal line shows the place of the leverage sample 
point whereas other sample points are in the interval [0,1]. If the leverage 
sample point is far from others, then the variance of the cross validation 
error becomes large. 


respectively. Here the generalization error is measured not by the average of 
any probability distribution but by the emprical average of x”. As the value 
R is larger, the effect of the leverage sample point becomes larger, resulting 
that the average of the cross validation error becomes different from that of 
the generalization error, and the variance becomes larger. 


Example 40. (High dimensional case) Let us study a statistical model of 
y € R for a given x € R¢@ with a parameter (a,s) (a € R%, s > 0), 


Plulesa,s) = y/ 2 exp(-Z(y a2) 


s 
—5(p + ula?) 


and a prior 


yp(a, s) « s” exp( 
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where r = d/2 and p = 0.005, and yw = 0.005 are hyperparameters. In this 
example, we studied cases when d = 20, or 50, and n = 100. A true density 
was set as q(y|x) = p(y|x, ao, $9), where ap = (0.5, 0.5, ...,0.5) and so = 100. 
Then the posterior density of (s,a) is given by 


p(s|X",¥") x 824-4? exp(—Cs/2), 
p(als,X",¥") = N4(B,A7'/s), 


where Nq(b, S) is the d dimensional normal distribution whose average is b 
and covariance matrix is S and 


A = SOX +h, 
c= 
B= Ae) 
7=1 
C = -tr(ABBT)+)°¥? +p. 


i=1 


Therefore by using the simple Monte Carlo method, we approximated the 
generalization loss G,,, the cross validation loss C;,, and WAIC W,. 

(1) d= 20, n = 100. Firstly, let X” consist of independent random variables 
which are subject to the normal distribution Vq(0, /) where J is the identity 
matrix. Then G, is mesured by the average over this distribution. The 
experimental averages and standard deviations for 1000 independent trials 
were 


Cr — Sp = 0.130, 0.034, 
Wr — Sn = 0.123, 0.033, 
Ge—8-—0.129,, 0.030; 


and the estimated errors were 


E[ Gn — S— (Cn — Sn)| ] = 0.0462, 
E[ |Gn — S— (Wa — Sn)| ] = 0.0456. 


(2) d = 20, n = 100. Secondly, X” was generated from Nq(0,/) and then 
fixed, and G,, was defined by empirical mean over the fixed X". The exper- 
imental averages and standard deviations of C;, — S, and W,, — S,, for 1000 
independent trials were same as (1), whereas those of G, — S were 


G, — S =0.108, 0.024 
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and the estimated errors were 


Hn 8 = (0,5) T= 0047, 
E[ |Gn — S — (Wn — Sn)| ] = 0.044. 


(3) d = 40, n = 100. The experimental averages and standard deviations 
for 1000 independent trials were 


Gn — S = 0.425, 0.030, 
Cn — Sn = 0.428, 0.047, 
Wn — Sn = 0.402, 0.047, 


and the estimated errors were 


[ |Gn — 5 — (Cn — Sn)| | = 0.056, 
E[ |Gn — S— (Wn — Sin)| ] = 0.058. 


(4) d = 40, n = 100. Secondly, X” was generated from N(0, 7) and then 
fixed. The experimental averages and standard deviations of C;, — S, and 
Wn — Sn for 1000 independent trials were same as (3), whereas those of 
G,, — S were 


G, — S = 0.336, 0.021, 


and the estimated errors were 


E[ |Gn — 5S — (Ca — Sn)| ] = 0.098, 
E[ |Gn — S— (Wa — Sn)| ] = 0.078. 


If d/n is not so large, W,, —S;, is the better estimator of G,,—S than C,— Sp, 
for both indepedent and fixed x”. If d/n is larger, then statistical estimation 
is not accurate. If x” is independent, C,, — S, is the better estimator of 
G, — S than W,, — S,. If x” is fixed, W,, — S, is the better estimator of 
G, — S than C, — S,. Our recommendation in practical problems is as 
follows. If either the cross validation loss or WAIC can be calculated by the 
Markov Chain Monte Carlo method, then the other can also be calculated 
by almost the same computational time. Therefore, we recommend that 
both of them would be calculated and compared. If they are different, x” 
may be dependent or contains the leverage sample point. 
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Example 41. (Time sequence) In time series analysis, we sometimes use a 
statistical model 


Y; = aYj_) + bY;_-2 + cYj;_3 + noise. 


This model is represented by 
pV Y=; Yj-2, ¥i23; a, b, a), 


which is equivalent to 
BUG ee a, b, e), 


where 
ti = (Yi-1, Yi-2, Yi-3) € R®. 


The sample point x; depends on Y; (j #7). However, this model is equiva- 
lent to the model in which {Y;} are independent under the condition that x” 
are given. If we adopt the assumption that a true probability distribution 
satisfies the same conditional independence as the statistical model, WAIC 
can be applied to evaluation of estimating accuracy at the set of empiri- 
cal points x”. Moreover, if the cross validation loss has almost the same 
value as WAIC, then the cross validation loss also can be employed as an 
approximated value of WAIC. 


5.6 Problems 


1. Let k be a positive integer and \ = 1/k. Prove the following equations. 
1 dr n 
[ exp(—na*)de = <(f exp(—y) y*! dy), 
0 n 0 
1 
i 6(t—a2*)\de = rt}, 
0 
1 


Dd 
kyz 

d. = 
fe 2 = oy 


where n > 0,0 <¢< 1, and Re(z) > —A. 


2. Let k be a positive integer and \ = 1/k. Assume that v(x) is a C; class 
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function on (—e, 1], where « > 0. Then prove that 


MA) (0) + o(1/n), 


1 
i. exp(—na*) p(x) dx 
0 


n 


[ 6(t — 2*)p(x)dx dt! y(0) + o(t*4), 
0 
1 


i. (8) p(x) de (0) +9(2), 


0 z+ Xr 
as n — oo, t > +0, and g(z) is an analytic function on Re(z) > —A which 
can be analytically continued to an analytic function on Re(z) > —2X. 


3. Let f(x,y) and g(x,y) be polynomials of (x,y) which satisfy f (0,0) = 0. 
Let U be an open set which contains (0,0) and U* = U\{(2, y); f(x, y) = O}. 
(1) Make an example of a set f(x,y) and g(x,y) which satisfies 


g(x,y) 
f(x,y) 


up | <oL: 
(x,y)EU* 


when g(x, y)/f (x,y) is not a polynomial. 
(2) Assume that f(a,y) = a27y?. Prove that if 


g( 
rit 


sup 2) | a 
(zyeu*! (2) 
then g(x, y)/f(x,y) is a polynomial. 
Explain the mathematical difference between (1) and (2). 


4. A neural network is defined by a conditional density of x,y € R, 
(y| b) : exp( zi atanh(bx))?) 
x, a,b) = —= —x(y— ; 
a Va ae 


We study a case when a probability density of x is the uniform distribution 
of [—2, 2]. A true conditional density is set as p(y|x,0,0), and the uniform 
prior for (a,b) on [—1,1] x [-1,1] is adopted. Then the true distribution 
is realizable by and singular for a statistical model. Figure 5.3 shows the 
posterior distributions for six independent sets of (X",Y”), where n = 100. 
Even if the true parameter (ao, bo) = (0.1,0.1), the posterior distributions 
have almost the same shapes as the case (ag, 69) = (0,0). In statistical model 
selection and hypothesis testing, we often have to determine (ao, bo) = (0,0) 
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Figure 5.3: Fluctuation of posterior distribution in neural network. Pos- 
terior distributions of six independent sets of X” are shown for the case 
n = 100. The set of the true parameters is {(a,b);ab = O}. 


or (ao, 69) 4 (0,0) based on a sample. Discuss whether we can apply regular 
statistical theory to neural networks or not. 


5. The statistical model, the true density, and the prior are set as in the 
above neural network. Hence the posterior distribution cannot be approx- 
imated by any normal distribution. Note that \ and m, which depend on 
the true distribution, are \ = 1/2 and m = 2. Let (@,b) be the maximum a 
posteriori estimator. Since the uniform prior of (a,b) on [—1, 1] x [-1,1] is 
adopted, it is equal to the maximum likelihood estimator in this case. The 
log loss function is 


1 n 
Lp(a, b) _ : S “log p(¥i| Xi, a, b). 
i=1 


The free energy F;, is equal to 


ix = -tog [Tf exp(-nbn(w))9(a,8)dadb 
i=1 
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Figure 5.4: Free energy and its estimators in neural network. Free energy, 
BIC, BIC using RLCT, and WBIC are compared in a simple neural network. 
BIC is larger than the free energy. Both BIC using RLCT and WBIC can 
approximate the free energy. 


In general, it is difficult to calculate the integration over the parameter set 
in F,,. However, in this case, it can be approximiated by the Riemann sum 
because the dimension of the parameter space is small (2). Its estimators, 
BIC, BIC using RLCT, and WBIC are given by 


BIC = nL,(4,b) + (d/2)logn, 
BiICrct = nLn(,b) + Alogn — (m — 1) log log n, 
WBIC = E\?), InLn(a,0)], 


where d = 2 and i nl | shows the expectation value by the posterior distri- 


bution with the inverse temperature 3 = 1/logn. The empirical entropy S;, 
is equal to L,,(0,0). Figure 5.4 shows F,,—nS,, BIC— Sn, BIC,.4—Sn, and 
WBIC — Sn. Both BIC using RLCT and WBIC can approximate the free 
energy, whereas BIC not. Discuss the difference of BIC, BIC using RLCT, 
and WBIC from the viewpoint of estimators of the free energy for n — oo. 
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Figure 5.5: Generalization error and its estimators in neural network. The 
generalization error, the cross validation error, the WAIC error, the theoreti- 
cal value, and the regular theoretical value are compared. The generalization 
error can be approximated by the cross validation and WAIC, but not by 
the regular theory. 


6. The statistical model, the true density, and the prior are set as in the 
above neural network. Figure 5.5 shows n(G;,—S), n(Cr—Sn), n(Wn—Sn), 
the theoretical value by standard theory, and that by regular theory. Regular 
theory cannot be employed in this case. Since the generalization error (G;,— 
S) and the cross validation error (C;, — S,) have the asymptotically inverse 
correlation, 


nC, ~S) 401, = 8S 2 6. (I), 


where \ = 1/2 in this case. Hence it needs many sets of (X”",Y”) to numer- 
ically show 


E|Gn — S] = E[C, — S,] + o(1/n). 


In this experiment, the results by 1000 independent trials are shown. Discuss 
the difference of standard theory and regular theory for n — oo. 


Taylor & Francis 
Taylor & Francis Group 
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Chapter 6 


General Posterior 
Distribution 


In the previous chapter, we introduced a standard form of a statistical model 
and a prior, based on which mathematical laws of the free energy and the 
generalization loss were proved. In this chapter, we explain that many mod- 
els and priors can be made into standard forms by using the algebraic geo- 
metric transform. Then the posterior distribution is represented as a finite 
mixture of locally standard forms, 


p(w) = > Standard form. 


As aresult, the same theorems of the previous chapter also hold in many sta- 
tistical models and priors. Also we show the difference between the Bayesian 
and the maximum a posteriori methods. This chapter consists of the fol- 
lowing sections. 

(1) In Bayesian estimation, the set of parameters can be understood as a 
union of local parameter sets. 

(2) Resolution theorem in algebraic geometry makes an arbitrary statistical 
model a locally standard form. 

(3) General theory of Bayesian statistics is established. 

(4) The generalization losses of the maximum likelihood and a posteriori 
methods are derived. 


6.1 Bayesian Decomposition 


In Bayesian statistics, the posterior distribution can be decomposed as a sum 
of local distributions. Let (p(x|w), p(w)) be a pair of a statistical model and 
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Statistical Model 


Posterior 
Distribution (1) 


O 


True 


Distribution Posterior 


Distribution (2) 


Posterior 
Distribution (3) 


Figure 6.1: Division of parameter set. Bayesian posterior distribution can 
be understood as the mixture of several distributions. 


a prior. A decomposition of a prior y(w) is 
ew) = >> e;(w), 
j 


where y;(w) > 0. Then the partition function or the marginal likelihood is 


given by 
2n => | gw) TT o(%ilu)dw. 
j i=l 


The function X”" +> Z,, is a probability density of X” according to the pair 
of the statistical model p(z|w) and a prior y(w), hence 


n 


Fie i. oy(w) T] p(X lew) aw 


i=1 


defines a probability distribution on the pairs {p(x|w), ;(w)}. The posterior 
average of an arbitrary function f(w) is also given by 


Sf toyesw) [To %ilu)ae 
_ j i=1 


EwLf(w)] = - 
& [ esl) TT reetuyae 
j i=l 
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Therefore Bayesian estimation can be studied from the viewpoint of the local 
parameter sets. 
The support of y;(w) is defined as the closure of nonzero set of y;(w), 


supp 9; = {w € W;y;(w) > O}. 


Let Wo be the set of all parameters that attain the minimum of the average 
log loss function L(w), 


Wo = {w € W; L(w) = min L(w)}. 


If {supp y;} M Wo is the empty set, then Z,(j) converges to zero faster 
than the others. If {supp y;}M Wo is not the empty set, and if a statistical 
model and a prior have a standard form in each local subset, then by using 
the local real log canonical threshold A; and its multiplicty mj, the local 
partition function can be rewritten as 


mj—-1 
Qos ny Tile), 


Zn(J) = y r 
i=1 


Ha ie 
where c; is a constant order random variable and wo € Wo. Let us define 


A 


m 


min{A;; 7}, 
max #{m,;A; = A}, 


where #{mj;Aj; = A} is the maximum number of j that attains A = Aj. 
Then asymptotically 


Zn (Soe) C82 TT wt Xile), 


j i=l 


where pe is the summation over {j;A = Aj,m; = m}, resulting that the 
free enegy or the minus log marginal likelihood has the same asymptotic 
form as the theorems of previous chapters. 


Example 42. In Figure 6.1, the circle shows the true distribution and the 
curved line is a set of statistical models. The posterior distribution is made 
of local parameters (1), (2), and (3). In such a case, the parameter set can be 
divided into three parts, and the posterior distribution can be represented 
by their summation. 
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Figure 6.2: Artificial examples of posterior distributions. The set of true pa- 


rameters is b(a—1)(a+1) = 0. The posterior distribution can be understood 
as a mixture of local standard forms. 
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Example 43. Let a statistical model and a prior be 


p(y|z,a,b) = = exr(—5ty — btanh((a — 1)a) tanh((a + 1)x)}?), 


_ f 1/8 (lal <2,[| <1) 
Bor = { 0 (otherwise ; 


where the set of all parameters is 
W = {(a, 6); |a] < 2, |b] < 1}. 


Also let (X",Y") be a set of random variables which are independently 
taken from a true distribution q(x)p(y|z,0,0), where q(x) is the uniform 
distribution on [—2,2]. Then the set of true parameters is {(a,b);b(a — 
1)(a+ 1) = 0}. Figure 6.2 shows four posterior distributions for different 
independent sets (X",Y"). By dividing parameter set 


W = {(a,b) €W;a< 0} U {(a,b) €e W;a > 0}, 


the posterior distribution is represented as a mixture of standard forms. To 
each distribution, we can apply the theory in the previous chapter because 
b(a—1) and b(a+1) can be made standard form by a; = a—1 and ag = a+1. 


6.2 Resolution of Singularities 


Even if the posterior distribution cannot be approximated by any normal 
distribution, there exists division of parameter set such that the average log 
density ratio function can be normal crossing in each local parameter set. 
The resolution theorem in algebraic geometry is the mathematical base for 
statistical analysis of general Bayesian statistics. 


Theorem 17 (Hironaka theorem). Let W be a compact which is the 
closure of an open set in R¢. Assume that K(w) > 0 is a nonzero analytic 
function on W and that the set {w © W; K(w) = 0} is not empty. Then 
there exist € > 0, {We;We C W}, and {Up; Up C R*} which satisfy 


{weW ; K(w) <e}=JWe, 
£ 


and, in each pair We and U;, there exists an analytic map g : Up > We 
which satisfies 
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where |g'(u)| is the absolute value of the determinant of the Jacobian matrix 


of w = g(u), eo 
owl = [aet( 5) 


and k,h (k > 0, h > 0) are d-dimensional multi-indexes, and b(u) > 0. 


Example 44. Let a parameter set be [—0.5,0.5]?. A statistical model and a 
prior are defined by 


1 1 
p(y|v,a,b) = nore 52 Y (a* — a*b + b°)zx)”), 
y(a,b) = 1, 


where o = 0.01. Let a true distribution be q(y|x) = p(y|x,0,0) and n = 100. 
The average log density ratio function is 


1 
K (a,b) = ala =a b+ by. 


Hence the set of true parameters is 
a= eb+e =o. 


We define a blowing-up by 


/ 
a= a, 


3a’b’. 


It follows that 
a —e bir =a"(@ —3h +270"). 


Figure 6.3 shows the set of true parameters on (a,b) plane, its blown-up on 
(a’,b’) plane, the posterior distribution of (a,b) and the blown-up posterior 
distribution of (a’,b’). Note that the determinant of Jacobian matrix of this 
blowing-up is 

g(a’, B)| = dja’, 


hence the blown-up posterior distribution is defined using this equation. On 
the (a,b) plane, the origin is a not normal crossing singularity, whereas, on 
the (a’,b’) plane, (0,0), (0,1/3) and (0,—1/3) are normal crossing singular- 
ities. Hence on the (a’,b’) plane, the posterior distribution is a mixture of 
standard forms. 
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Figure 6.3: Example of resolution theorem. The set of true parameters on 
(a,b), its blown-up on (a’,b’), the posterior distribution of (a,b) and the 
blown-up posterior distribution of (a’,b’). By using the resolution of singu- 
larities, any singularity can be understood as an image of normal crossing 
singularities. By using resoluion theorem, most statistical models can be- 
come standard forms, to which the theorems in the previous chapter can be 
applied. 
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Definition 23. In the resolution theorem, g(u), k, h, and b(w) depend 
on U,, although such dependence is not explicitly represented because of 
simple description. If such dependence is necessary for description, then 
representations g¢(w), ke, he, and bg(u) are used. The real log canonical 
threshold Ag and its multiplicity my are determined in each Uy. Then the 
real log canonical threshold and its multiplicity of K(w) and y(w) are defined 
by 


A min{A; ; ¢}, 


m = max{me; =A}. 


By the definition, A is a positive real number and m is a positive integer 
which is not larger than d. 


Remark 46. Let us explain several points about Hironaka Theorem. 

(1) In this book, knowledge about algebraic geometry is not necessary. How- 
ever, a reader who is not a mathematician may see an introductive book to 
nonmathematicians [82], in whicn many statistical models and learning ma- 
chines are studied. A book [69] is written for mathematicians who are not 
majoring in algebraic geometry. A book [51] is basic and famous in alge- 
braic geometry. From a computational point of view, [14] is recommended 
for students. The resolution of singularities is proved by [37] and studied as 
one of the main themes in algebraic geometry [42]. Mathematical relation 
to algebraic analysis is introduced by [9] and [41]. 

(2) The number of elements of {W,}, which is equal to that of {U?}, is finite. 
(3) Since k > 0 and h > 0 are multi-indices, if eq(6.1) and eq.(6.2) can be 
more explicitly written, 


K(g(u)) = (uy)? (ug) +++ (ua)**, 
Ig'(u)|_ = b(u)|(ur)"* (ua)? ++ (ua)”*4|. 


Note that |g/(w)| = 0 if and only if vu’ = 0, the map w = g(u) is one-to-one 
if and only if |u| > 0. Such a function w = g(u) is called a birational map. 
(4) In this book, manifold theory is not required, but if a reader already 
studied manifold theory, then UpW¢ and UpU¢ are compact subsets of mani- 
folds and g(u) = {ge(w)} is a map from a manifold to a manifold. In other 
words, ge(u) can be understood as a restriction to a local coordinate of a 
map from a manifold to another manifold. 

(5) In general, for a given function K(w), neither {Ug} nor w = g(u) is 
unique. Neither & nor fh is unique. However, the real log canonical thresh- 
old X and its multiplicity m are uniquely determined and do not depend 
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on the choice of w = g(u). Such values are called birational invariants. 
The real log canonical threshold is an important birational invariant in high 
dimensional algebraic geometry, whereas it is also important in Bayesian 
statistics. All Bayesian observables are automatically birational invariants, 
because they do not depend on the choice of parameter representations. 
(6) If K(g(u)) = Ko(u)u?* for some Ko(u) > 0 and k; > 0, then by 


/ _ 1/2k 

uy = Ko(u) / ‘U1, 
! 

Ug = U2, 
/ —_ 

Ug = Ud; 


eq(6.1) and eq.(6.2) are satisfied about u’, because « > 0 can be sufficiently 
small. 

(7) This theorem is called Hironaka’s resolution of singularities, which is 
the fundamental theorem in algebraic geometry. It holds for an arbitrary 
analytic function K(w). Note that this theorem can be employed even if 
one does not know the definition of singularities. If A(w) is an analytic 
function, then it holds even if K(w) = 0 does not contain singularities. 

(8) Even if K(w) is not an analytic function, there are several cases to 
which this theorem can be applied. For example, if A(w) is a piecewise 
analytic function, then this theorem can be applied to each parameter set. 
If K(w) = Ko(w)K1(w), where Ko(w) > 0 may not be analytic and Ky (w) 
is analytic, then the resolution theorem can be applied to K1(w) and the 
same statistical theory can be derived. 

(9) If a prior has a hyperparameter, then {Ac} and {me} are functions of the 
hyperparameter. In general they may be discontinuous or nondifferentiable, 
which is the main reason for phase transition of the posterior distribution. 
See Section 9.4. 

(10) For a given function K(w), the Hironaka theorem gives the algebraic 
algorithm by which the resolution map can be found. However, in general, 
it is not easy to find the resolution map w = g(u). For several statisti- 
cal models, the complete resolution maps were found. For others, partial 
resolution maps were found by which the upper bounds of the real log canon- 
ical thresholds were derived. Even if the complete resolution map cannot 
be founded, its existence enables us to prove universal formula in Bayesian 
statistics. For example, we have methods by which the generalization loss 
and the minus log marginal likelihood are estimated without information 
about the resolution map. 
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Example 45. (Resolution by projective space) Let « = (y,z),w = (s,t) € R? 
and 
W ={w=(s,1); 0 < 3; < 1}. 


A statistical model and a prior are 


ply.als,t) = <exp(-(y—s)? = (2-1), 
p(s,t) = 1, 


Assume that q(x) = p(z|0,0). Then the optimal parameter that minimizes 
the average log loss function is wo = (0,0). The log density ratio function 
and its average are respectively equal to 


f(z,w) = s?+t? —2ys — 2zt, 
K(w) = s?+2?. 


This is not a standard form. In Example 26, we showed it can be made 
a standard form by using a polar coordinate system (r,@). Here we give 
another transform. 


W = W, UW, 


Wi 
W, = {(s,t)©W; s< th}. 


II 
_— 
wo 
~ 
~—" 
ay 

wH 

V 

~ 
ae 


Then we prepare two other parameter sets, 


U1 — {(s1, t1) ; Ds Si,t1 < i 
Us = { (99, to) 5 0S 99,40 = 1}: 


Then by using two maps, 


a= 37 = Soto, 


— Sty = to. 


The log density ratio function and its average are respectively given by 


f(y, 2, 8,t) = 81(s1 + sit? — 2y — 2zty), 
K(s,t) = sj +), 
and 
f(y, 2, s,t) = to(s3te + te — 2yse - 22), 


K(s,t) 


t3(s5 + 1), 
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both of which are standard forms. The set U; U Us is called a projective 
space. The integration of an arbitrary function f(s,t) can be divided, 


[ f60 ds dt f(s,t)dsdt+ | f(s,t) ds dt 
Ww Wy W2 


f (s1, S1t1) t1 dsy dt, 
Ui 


+ |  f(s2, Sata) 82 ds2 dto, 
U2 
where we used the absolute values of the Jacobian determinant are ty and s9 
respectively. Therefore, we can apply the standard theory to this case, and 
the real log canonical threshold and its multiplicity are \ = 1 and m = 1. 


Remark 47. (1) By the same method as Example 45, a regular posterior 
distribution can be made standard. Hence the regular statistical theory 
which requires the positive definiteness of the Fisher information matrix 
is a very special case of general statistics. In general theory, the Fisher 
information matrix may contain the eigenvalue zero and the transform from 
the parameter set to another parameter set may not be diffeomorphism. It 
is well known that Bayesian statistics gives the better estimation than the 
maximum likelihood one in the case when the Fisher information matrix is 
degenerate. Such a fact can be mathematically proved by using algebraic 
geometry. 

(2) If a statistical model has a relatively finite variance, there exist co > 0 
such that 


| ole) Fle, w)Pae < cy f a(2)f(2,w)de = yh Cw): 
Hence if both f(#,w) and K(w) are analytic functions of w and if 
K(w) = w*, 
then there exists a function a(x, w) such that 
f(z,w) = w* a(z,w). 


Therefore, the form of the log density ratio function is automatically derived 
from its average function. 

(3) If p(w) = 0 on the set {w; K(w) = 0}, then by applying resolution 
theorem to K(w)y(w), there exists a function w = g(u) such that 


K(g(u)) = wu, 
(g(u))I9'(w)| |u"|b(u), 
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where b(u) > 0. Such a method is called simultaneous resolution of singu- 
larities in algebraic geometry. Algebraic geometry is known as one of the 
most abstract mathematics, however, its concrete version gives the essential 
base of Bayesian statistics. 


Example 46. Let (x,y) € R?, w = (a,b,c) € R?, and 
W = {w = (a,6,¢) ; |al, ||, |e] < 1}. 


A statistical model and a prior are 


_ f 1/2 (\2| <1) 
ay = 110 sy eS 
p(x, yla,b,c) = 1 exp(—5(y ~ a8(b2) ~ cx), (6.4) 
wla,b,c) = 1/8, (6.5) 


where S(2) = 2 + 2. If a true distribution is q(x, y|0,0,0), then the log 
density ratio function and its average are 


f(x, yla,b,e) = 5{(aS(bx) + ex)? — 2y(a(S(bn) + ex)}, 


1 
KG, be) -= g lab +c)? + <a76*. 
Let us divide the parameter sets by 
Wi = {lal < lel}, 
W2 = {lal 2 lel, lab] < jab + cl}, 


W3 {lal > |el, Jab + e| < Jab" }}, 
Wa {lal > |el, |ab"| < Jab + ¢| < abl}. 


Then W = U,W;. The function w = g(u) is defined by 


a= ajc, b= by, C=C, 
a=az, b=bec., c= ao(1—b2)co, 
n= 03; b = bs, c = a3b3(b3c3 — 1), 
a=a4, b=bac4, C= agbaca(c4 — 1). 


The corresponding U; = {(a;,b;,cj)} (g = 1,2,3,4) are defined by 


U; = g *((W5)°); 
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where (W;)° is the maximum open set that is contained in W; and g~!((W;)°) 
is the minimum closed set that contains g~'((W;)°). Note that the function 
w = g(u) is one-to-one on (W;)°, hence g~! is well-defined on such a set. 
The average log density ratio function is given by 


1 
K(a;b,¢) = d(s (ayb, +1)? += 5aiot), 


= azc3 a(; + =03). 


- nha). 
= adbick (5 + 302). 


Hence K(a,6,c) is normally crossing in each local coordinate. The absolute 
value of the determinant of the Jacobian matrix is 


lg’(u)| = |e 
= |a2e2| 
= |a3b3| 


= |agbaca|?. 
The real log canonical threshold A; and its multiplicity mj; for each U; are 


Ay = 1, m, = 1, 
A2 = 1, Mo = 2, 
A3 = 3/4, m3 =1, 
At = 3/4, ma = 1. 


Hence the real log canonical threshold and its multiplicity of (K(w), y(w)) 
are \ = 3/4 and m = 1 respectively. 


Remark 48. For a given average log density ratio function and a prior, there 
exists a recursive and algebraic algorithm by which an analytic map w = 
g(u) is found in a finite procedures. 


Example 47. By using the example above, we investigate the phase transition 
of Bayesian statistics. Instead of the prior given in eq.(6.5), let us study the 
prior which has a hyperparameter a > 0, 


(a,b, ela) = Flal*?. 
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Then in each U;, 


Therefore, 


y(g(u))|g'(u)| = 


a/8 


a/8)|a¢b 4cal- 


|a$b3| 


( 
( 
( 
( 


The real log canonical threshold A; for each U; is 


Mt _ (a 3 1) 72, 

A2 min{(a + 1)/2, 1}, 
A3 min{(a + 1)/2,3/4}, 
Ag = min{(a+1)/2,3/4}, 


Ifa 41/2 and a £1, then m; = 1. Therefore 


_J (a@+1)/2 (a<1/2) 
a={ 3/4 (a > 1/2) 


If a = 1/2, then m = 2, otherwise m = 1. Note that \ is a continuous func- 
tion of a, but not differentiable at a = 1/2. The posterior distributions for 
a> 1/2 and a < 1/2 are quite different from each other. The hyperparme- 
ter a = 1/2 is called the critical point of the phase transition. Because the 
real log canonical threshold determines the asymptotic properties of the free 
energy and the generalization loss, hyperparameter control is important for 
nonregular statistical models. 


6.3. General Asymptotic Theory 


By using the resolution theorem, in the parameter set {w € W ; K(w) < ¢}, 
the pair of a statistical model and a prior can be made a standard form. Here 
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€ > 0 is a sufficiently small constant in resolution theorem. The normalized 
partition function is represented by the sum 


ZO) — ZO 4 72), (6.6) 


nm nm nm 


By using the constant « > 0, 


23 
I 
el 
a 
@ 
ia 
=. 
| 
= 
a 
= 
58 
= 
Q 
& 


Here Zh and Vie are the integrations of parameters in a neighborhood of 
the optimal parameter set Wo and the outside respectively. 

The nonessential part 72) can be bounded by the following procedures. 
By the same definition used in Section 4.1, 


oye Le Kw) = fi) 
0) = ads Tica) 


and 
Yn = sup |yn(w)], 


wewo 


the following inequality is derived, 


nK,(w) = nK(w)— /nK(w)yn(w), 
> nK(w)/2—72/2. 


Therefore 


IA 


Z2) << exp(o2/2) | exp(—nK (w)/2)p(w)dw 


K(w)>e 
< exp(—ne/2-+72/2). 


Hence Z@ = 0p(exp(—ne/3)). Let us study the essential part Zo, By the 
resolution theorem, there exists w = g(u) such that 


{w; K(w) <e} c Lo), 
j 


where, in each U;, 


aS) 
— 
Ss 
Ss 
= 
| 
= 
oy 
Ss 
= 
= 
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where multi-indexes k; and h;, and a function b;(u) depend on U;. In each 
U;, the local real log canonical threshold Aj, its multiplicity m,;, and local 
redundant index pj; are determined by the same method used in the previous 
chapter. Then D;(u) is defined by the same manner as eq.(5.27), 


1 és 
D;(u) = em) ela |b; (u)x(u). 


The normalized posterior function on U; is 


n 


e(g(u))l9'(u)| | | pXilg(u)) 


j=) 
n 
[[ C%lwo) 
c=1 


= exp(—nKin(9(u)))b(u)|u"| 
m —1 


= He Datu) [at ed exp—t+ VE El) 
j 0 


ae 
(log n)”™—! 
+op( ni i; 


This equation shows the asymptotic form of the posterior distribution. When 
n tends to infinity, only the sets {U;} that maximize 
(log n)™~* 
nri 


affect the asymptotic form, and are called essential local coordinates. In 
other words, the ratio of Q(g(u)) of a set {U;} whose (logn)™—1/ni is 
smaller than the essential local coordinate goes to zero in probability. By 
the Definition 23, a set U; is an essential local coordinate if and only if 
Aj =A and m; =m. Let ELC be the set of all suffixes 7 such that U; is an 
essential local coordinates. Then the general theory can be established by 
the same procedure as the standard theory with the replacement, 


D(w) 4 S2 D;(u). 


j€ELC 


Let us summarize the general asymptotic theory. The asymptotic free energy 
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Fr, = nLIn(wo) + Alogn — (m — 1) loglogn 
—log( [ eady(w) [ dt t)—! exp(—t+ Vt £n(u))) 
jCELC 0 
+0,(1). 


The renormalized posterior distribution is defined for an arbitrary function 
z(t, u), 


3 / dw Dil) / * deeb ep 44 EE) 


_ j€ELC 


(2(, u)) = i 
2S [annie f dt t+ exp(—t + Vt €,(u)) 


jCELC 
Then, the renormalized posterior distribution satisfies 
1 
(t) =A4 F(vE E(u). 


The scaling law which connects the original posterior of w and the renor- 
malized one of u is given for an arbitrary positive integer s, 


Bult 2,w)] = —y( (VE ale,u))* ) + o(—9), 
Eu[K(w))] = (6) + op(). 


By using the renormalized posterior distribution, Fluc(&,) is defined by 
Fluc(€,) = Ex[(ta(X,u)?) — (Vta(X,u))?. 
Then the singular fluctuation is defined by 
2v = E¢[(Vt€(w))] = Ee[Fluc(€)]. 


Asymptotic behaviors of the generalization loss, training loss, cross valida- 
tion loss, and WAIC are given by 


Gn = L(wo) +=(A4 5(viEa(u)) — 5FIme(Em)) +op(=), (6:7 
Ty = Ln(wo) + —(A— 5 (ViE(u)) — 5FluelEn)) + op(=), (68) 
Cn = En(wo) +—(A-5(VtEn(u)) + 5Fluctén)) +,(—), (6.9) 
Wy = Lulu) +—(A— 5(V¥En(u)) + 5FluclEn)) + 09(—), (6:10) 
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respectively, whose expectations are 


BGn] = L(w) += +0(—), 
Bln] = E(w) + —(A=2v) + 0(-) 
BICn] = E(wo) += +0(-), 
B[W,] = L(wo) += +0(-) 


respectively. Note that the convergence in probability holds, 


n(Gn — L(wo)) + n(Cn — Ln(wo)) > 2, 
n(Gr — L(wo)) + n(W, — Ln(wo)) 9 2A. 


In other words, the generalization error and the cross validation and WAIC 
errors have the inverse correlation. The functional variance is defined by 


—— st Ew [(log p(X;|w)*] — Exp[log p(Xilw)]? 


w=1 


Then nE[V,,] — 2v. The universal laws of Bayesian statistics hold, 


E[Gn] = E[C,] + (=), (6.11) 


E[G,] = E[W,| +0(-). (6.12) 


If a sample is not independent, eq.(6.11) does not hold in general, whereas 
eq.(6.12) holds in the cases discussed in Section 5.5. 


Remark 49. Let Wo be the set of parameters which minimize the average log 
density ratio function K(w). The posterior distribution is the summation of 
local distributions near Wo. However, when n — co, the posterior parame- 
ters are not to distributed on all neighborhoods of Wo but restricted on the 
essential local coordinates. Recall that the essential local coordinates are 
characterized by the phenomenon that the local real log canonical thresh- 
old is minimized. In other words, only the local parameters that have the 
smallest local real log canonical thresholds are realized by the posterior dis- 
tribution if n is sufficiently large. If a prior has a hyperparameter, then the 
essential local coordinates are changed by controlling the hyperparameter 
(phase transition), which affects the free energy and the generalization loss. 
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0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 
Gn-S Cn-Sn Wn-Sn 


Figure 6.4: For a reduced rank regression model, the generalization, cross 
validation, and WAIC errors are compared, G, — S, C,, — S$, and W,, — Sy. 
Averages are asymptotically equal to A/n. Resolution theorem gives the 
mathematical prediction of the generalization error whose average can be 
estimated by cross validation and WAIC errors. 


It is remarkable that the Bayesian posterior distribution automatically mini- 
mizes the free energy and the generalization loss for a given hyperparameter. 
This is the fundamental mathematical structure of Bayesian statistics. 


Example 48. Let M > 0, N > 0, H > 0, and Ho > 0 be integers. The 
reduced rank regression is defined by the statistical model of y € R% for a 
given « € RY 

i 


1 2 
QnyXP exp(—5lly — BAz||*), 


where A = (Ajx) and B = (Bye) are N x H and H x M matrices which have 
real coefficients respectively. A prior is set 


(A,B)  exp(—5( (Aja)? + Yo (Bre)”)). 
ke 


j,k 


p(y|z, A,B) = 


Assume a true distribution q(y|xz) = p(y|Ao, Bo) which satisfies Hp = 
rank(BoAo), Ho < H. The complete resolution map was given by [5]. 
(1) IEN+H) <M+H,M+H) <N+H,H+Ho < M+N, and 
M+N+H +4 Ho is an even integer, then m = 1 and 

d = (1/8){2(H + Ho)(M +N) —(M —N)? — (H + Hp)’}. 
(2)IEN+H) <M+H,M+H) <N+H,H+Ho < M4N, and 
M+N+H + Ho is an odd integer, m = 2 and 


d = (1/8){2(H + Hp)(M +.N)—(M—N)?- (H+ Hp)? +1}. 
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(3) If N+ Hp > M+ H, then m= 1 and 
A = (1/2){HM — HH) + NAb}. 
(4) If M+ Hj) > N+ H, then m = 1 and 
d= (1/2){HN — HH + MHp}. 
(5) If H+ Hp > M+N, then m= 1 and 
A = (MN/2). 


Note that if a prior satisfies y(A, B) > 0, then \ and m are same as above. 
Figure 6.4 shows experimental results for the generalization, cross validation, 
and WAIC errors for M = N = H = 5, Ho = 3. Then, for n = 100, 
A/n = 0.12. The experimental averages and standard deviations were 


Gn—-S = 0.126, 0.035, 
Cy —Sn = 0.127, 0.034, 
Wn—Sn = 0.126, 0.034. 


It seems that the sample size n = 100 is not sufficiently large, however, 
the theoretical value coincides with the numerical results. From the math- 
ematical point of view, the general theory needs the condition that n is 
sufficiently large, however, it holds even for smaller n in many concrete 
statistical models. Hence the real log canonical threshold can be used for 
checking whether the posterior distribution is accurately approximated by 
Markov chain Monte Carlo or not. 


6.4 Maximum A Posteriori Method 


In this section we study the asymptotic property of the maximum a poste- 
riort method in nonregular cases. If a prior is defined by the uniform distri- 
bution, then it is equivalent to that of the maximum likelihood method. In 
general, their results are very different from Bayesian estimation, because 
a single parameter is chosen. In statistical inferences, a single parameter is 
far from a distribution on a parameter. For example, see Example 60. The 
average and empirical log loss functions are 


HG), = = / a(2) log p(xlw)de, 


by 

Ss 

& 
| 


1 n 
—= J log p(X;lw), 
m i=l 
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where w € W. In this section, we study the case that W is a compact set. 
Let w be the parameter that minimizes 


£(w) = L(w) ~ = log pw), 


which is called the maximum a posteriori (MAP) estimator. If log y(w) is 
a constant for all w, then w is the maximum likelihood estimator (MLE). 
The generalization and training losses are respectively defined by 


Grn(MAP) = L(w), 
T,(MAP) = Ln(t): 


The set of optimal parameters Wo is defined by 
Wo ={w EW; L(w) is minimized}. 


We assume that the parameter set Wo is compact and the convergence in 
probability holds, 


min ||w — wol| > 0. 
woeWo 


Hence we can restrict the parameter set in the union of the neighborhoods 
of Wo. By using the resolution theorem in each local neighborhood U;, 


nKn(g(u)) 


~ “og p(Xilg(u)) 
i=1 
= nr — Srubén(u), 


where we can assume u € [0,1]? without loss of generality and k = 
(k1,k2,...,kq¢). In this section, the integer r (1 < r < d) is defined so 
that ky, ko,...,k, > 0 and 


2k _ , 2ki, 2ke 2k 
UY = Uy Ug 


In other words, {k,.} does not contain zero. For a given u, let a (1 <a<r) 
be the positive integer which satisfies 


an 


uz 
7 (i =1,2,...,r). (6.13) 


A 


A map 
(0, 1]? > wes (t,v) = (t, (v1, v2, -.-,vq)) € R! x [0, 1]? 
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is defined by 


Ui 


u2 — (kj/ka)u2 (1 <i<r) 


Uj (r<i<d) 
By the definition, vg = 0, hence v is contained in the set, 
V = {x = (21, 22,...,0¢) € [0,1]? ; x129--- a, = 0}. 


Moreover, the map 
[0,1]? >ur (t,v) eT xV 
is one-to-one. 


Example 49. Let us illustrate a case when u?* = u?u$. Then (k1, k2) = (1,2), 


t = ujug, 
v1 = «fuz—(1/2)u2 (if u3/2 < u2), 


vg = 4/uz—2u? (if otherwise). 


Figure 6.5 shows the coodinate (t,v;) and (t,v2). The coordinate (t, v1) is 
used if u3/2 < u? whereas (t, v2) if otherwise. 

By using the coordinate (t,v) and Theorem 8, £L(g(u)), L(g(u)), and 
L,,(g(u)) are represented as the functions of (t, v), 


L(t,v) = LIn(wo)+t—V/t/n En(t,v) — tog y(t,v), (6.14) 
L(t,v) = L(wo) +t, (6.15) 
In(t,v) = Ln(wo) +t— Vt/n En(t,v). (6.16) 


By using these representations, we can derive the average and empirical log 
losses. Before the theorem, we prepare a lemma. 


Lemma 25. Let f(u) be a C1-class function of u. There exists C > 0 such 
that 
If(tr) — FO) <owCMysF, O<t<1), 


where 2\k| = 2(ki +---+k,) and 


V su max | OF u)}. 
wale Peni raes, Du; | 
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Coordinate u and (t,v) 


0 0.2 0.4 0.6 08 1 


Figure 6.5: Coordinates for MAP and ML. The coordinate for analyzing the 


maximum a posterior estimator in the case u?* = u2u$. Coordinates are 


t = utuS, vy, = (u? —u3/2)/?, and vg = (u3 —2u?)'/?. The coordinate (t, v1) 
is used if u3/2 < u? whereas (t, v2) is used if otherwise. 


Proof. For u = (t,v) and u’ = (0, v), 


f0)— 0,0) = [F(u)— F(u!)| < [few IVA 
Vmax uj — ui) [Vf 


Ole Envi, 


IA 


IA 


where the last inequality is proved as follows. If 7 = a then |u; — u’, 


= 
o] as 
else if 7 A a then u2/kg < us /ky. Hence 


uj — ub] = uy — (uF = (hy /ka)ug)/?| 
(kj /Ka = 
2 
j 


uj + ( 


= (kj /Kq)u2 Ua < Repke Ua. 


J 


There exists C’ > 0 such that 
(tg)?! < Cc! urk 


which completes the lemma. O 
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The asymptotic behaviors of average and empirical log loss functions for 
the MAP method are derived by the following theorem. Let us define 


w(u) = — log y(g(u)). 


Let u* be the parameter that minimizes the following function, 


ae arg ain {3 min{0, &(u)}? + H()}. (6.17) 


Then we obtain the following theorem. 


Theorem 18. The average and empirical log loss functions of the MAP 
method satisfy 


L(@) = L(wo) + = max{0,6(w*)P + op(= 
E(t) = Em(wo) — Z max{0, ful") + op( 
Proof. By applying Lemma 25 to &,(t,v), if t > 0, then 
E(t, 0) — €, (0,0) = 0,1). 
The eq.(6.14) is equal to 


). 


n 


£(t,») = (vt- a) 7 uD n vt ) pio). 8 


Let (¢,6) be the set of parameters that minimizes L(t,v). Firstly we study 
the case €,(¢,6) < 0. Then there exists t* = O,(1) such that 


x E 
vie £, 


because, if this equation holds, the order of L(t, 6) — Ln(wo) is not larger 
than (1/n), if otherwise, it is larger than 1/n. Then 


é 6)? 
“fe &, (0, dja S00" + u0,0)} 


Lii,t) = ‘ 


2 


Since €,,(t,6) < 0 an 


d Op(1/n), En(0,6) < op(1). Therefore L(t, 6) is 
minimized by t* = o,(1 


and 


6 = argmin,,~)(0, v) + op(1). 
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Thus ¢ = 0,(1/n) and eqs.(6.15) and (6.16), 
L(w) 
In(w) 


L(wo) + Op(1/n), 
Ln(wo) + op(1/n). 


Secondly, we study the case €,(t,6) > 0. There exists t* = Op(1) such that 
re | 
t= se lEnlt, 8) +) = 5 (En(0, 8) + t°) + op(1/V0n), 


because, if otherwise, the order of L(t, 6) is larger than (1/n) by eq.(6.18). 
Then also by eq.(6.18) and ¢ = 0,(1), 


clés) =< CY _ &a(0,6)? | 9,8) 


4n an Tt En(wo) + op(1/n). (6.19) 


Hence t* = o,(1) and 


6 = argmin, (—€,,(0, v)? + 4¥(0, v)) + 0,(1). 


Therefore, 
Ld) = L(wo) + -&n(0,6)? + op(1/n), 
En(ti) = Ln(wo) ~ —&x(0,6)? + 0p(1/n). 
By integrating the above two cases, the theorem is obtained. O 


Theorem 19. There exists uy > 0 such that 


E[L(@)] = E(w) +4 +0(-), 
E[Ln(ib)] = In(wo)- 4 +0 *), 


Proof. The parameter u* is defined by minimization, 


{0g wP+u(u)b. (6.20) 


1 
u* = argmin wo 
. 4k Gta 


By using the convergence of the empirical process €,(u) — €(u), and its 
average in Sections 10.4 and 10.5, this theorem is obtained by 


w= lim Efmax{0, €,(u*)}7]/4. 


noo 
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Trajectory of the Steepest Descent 


0.2 0.4 0.6 0.8 1 


Figure 6.6: Trajectory of steepest descent. The contour of the square error of 
a simple neural network and trajectory by the steepsest descent are shown. 
The parameter has two representations (a,b) and (t,v), where t = a7b?. 
Optimization about t makes the generalization error smaller, whereas that 
about v larger. 


Example 50. (Over-training) Let us study a simple neural network of x, y, a, b € 
R, 


: (y — atanh(be))?), 


e(-529 


p(y|x, a,b) = 
(I ) Ino 


where o = 0.1. The random variable X is subject to the uniform distribution 
on [—2, 2] and the true conditional density q(y|x) = p(y|x, 0,0). The contour 
of the square error 


E(a,b) = eee, — atanh(bX;))? 
i=1 


and the trajectory by the steepest descent are shown in Figure 6.6. In this 
case t = ab”. The parameter can be represented by (a,b) and (t,v). In the 
steepest descent, the parameter t is rapidly optimized, whereas v is searched 
very slowly. Optimization about v gives the over-training, that is to say, the 
generalization error is made smaller by optimization about t, but larger 
about v. If Bayesian estimation is employed, then the posterior distribution 
is spread over {v} and t is optimized, which makes the generalization smaller. 
This is the main difference between MAP or ML and Bayesian estimation. 
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Remark 50. (1) The constant 4: depends on the prior y(w). In the maximum 
likelihood method, it is given by the average of the maximum value of the 
Gaussian process. If W is compact, it is finite but not small in general. If 
W is not compact, the maximum value of the Gaussian process is not finite 
in general, hence the asymptotic properties of the above theorem do not 
hold. The maximum likelihood method is not appropriate if the likelihood 
function cannot be approximated by normal distribution. 

(2) In the above theorems, we study the maximum a posteriori (MAP) 
method on the parameter w € W. In order to study MAP method on 
u € UU;, then y(g(u)) should be replaced by y(g(u))|g/(u)|. Note that 
MAP depends on the transform of the parameter. In other words, MAP is 
not invariant about parameter representation, whereas ML and Bayes are 
invariant. 

(3) In general, the average parameter E,,|[w] does not converge to Wo in 
normal mixture and neural networks. Hence 


L(E,,[w]) = nL(wo) + nC + O,(1). 


In other words, the posterior mean estimator is not appropriate if the pos- 
terior distribution cannot be approximated by a normal distribution. 


6.5 Problems 


1. For a given analytic funciton K(w) > 0 and a prior y(w), the partition 
function, the state density function, and the zeta function are defined by 


S| 
= 
I 
Ss 
a 
a 
| 
ia 
= 
5a 
= 
a 
& 
— 
V 
= 


Show that the following are equivalent. 

(1) If n > oo, then log Z(n) + Alogn > 0. 

(2) If t > +0, then v(t)/t*-! > c > 0 for some c. 

(3) ¢(z) is holomorphic in the region Re(z) > —A and has a pole at z = —A 
with the order 1. 


2. Assume that two sets of analytic functions and priors (Ky(w), y1(w)) 
and (Ko(w),ye(w)) have the real log canonical thresholds 1 and Ag, re- 
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spectively. Also assume that K,(w) > ck2(w) for some c,; > 0 and that 
yi(w) < coy2(w) for some cp > 0. Then prove that Ay > Ad. 


3. Assume that two sets of analytic functions and priors (Ay (w1), y1(w1)) 
and (F‘2(w2), ~2(we)) have the real log canonical thresholds A; and 2, re- 
spectively, where w, and wy are different variables. Then prove the following. 
(1) The real log canonical threshold of (Ay (wi) + K2(we2), 91(w1)~2(w2)) is 
Ay + Ao. 

(2) The real log canonical threshold of (Ay (w1)K2(w2), p1(wi)y2(we)) is 
min{A1, Ao}. 


4. Assume that two sets of functions { f;(w);7 = 1,2,..., J} and {g,(w);k = 
1,2,..., K} satisfy 


J K 
{I> aj(w) fj(w)sa;(w) € R= {> ew)gu(w);bj(w) ER}, (6.21) 
j=l k=1 


where R = R[w}, we, ..., Wa] is the polynomial ring generated by 1, wi, we, 
...,Wq with the real coefficients. Then prove that the following two functions 
have the same real log canonical thresold if the same prior is employed. 


J K 


(fi(w))?,  SO(ge(w))?. 


1 k=1 


J 


Note that if eq.(6.21) holds, it is said that the ideal generated by {f;(w)} is 
equal to that of {g;(w)}. 


5. Let us define a function F’, by 
r= tog f exp(—nL(w))o(w), (6.22) 
where L(w) is the average log loss function. Then prove that 


F, = nL(wo) + Alogn — (m — 1) log logn + O(1). 


Hence the difference between F,, and F’,, is a constant order random variable. 
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6. For a general random variable X, 


Elexp(X)] > exp 


|X] holds, which is 


called Jensen’s inequality. Prove that for a general random variables X and 


Y and a general function f(X,Y), 


'x| — log Ey[exp(f(X,Y))]] < —log 


ExIf(X,Y))) ] 


by using Jensen’s inequality. By using this inequality, prove that 


By Fey 


where F’, is defined by eq.(6.22). 


Taylor & Francis 
Taylor & Francis Group 
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Chapter 7 


Markov Chain Monte Carlo 


The Markov chain Monte Carlo method (MCMC) enables us to numerically 
approximate the posterior average for an arbitrary statistical model and 
prior. If a posterior distribution is spread on some local parameter region, 
then MCMC approximation is accurate, otherwise it is still not so easy. In 
many important statistical models such as a normal mixtture or an artificial 
neural network, the Bayesian inference attains much more precise estima- 
tion, hence it becomes more important to construct the MCMC algorithm 
which works even in singular posterior distributions. In this chaper, we in- 
troduce the basic foundations of MCMC process. 

(1) The Metropolis method is explained. The Hamiltonian Monte Carlo and 
the parallel tempering are its advanced versions. 

(2) The Gibbs sampler is introduced. Nonparametric Bayesian sampler is 
its advanced version. 

(3) Numerical approximation methods of the generalization loss and the free 
energy using MCMC method are explained. 

In order to check how accurate MCMC approximates the posterior distri- 
bution in singular cases, the real log canonical threshold would be a good 
index for a given set of a true distribution, a statistical model, and a prior. 


7.1 Metropolis Method 


Let p(x|w) and y(w) be a statistical model and a prior, where  € RY, w € 
W CR?¢. The Hamiltonian function H(w) is defined by 


H(w) = — 5— log p(Xi{w) — log p(w). 
i=1 
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Then the posterior distribution is represented by 
1 n 
pw) = sow) [] pXile) 
ik i=l 
1 
= = exp(—H(w)), 


Zn 


where Z,, is a partition function or the marginal likelihood, 


C= / exp(—H(w))dw. 


The probability density function p(w), which is equal to the posterior distri- 
bution in Bayesian statistics, is called the equilibrium state of the Hamilto- 
nian H(w). Our purpose in this chapeter is to generate {wpe such that, 
for an arbitrary function f(w), 


1 K 
| fopplwdw = = Fun) (7.1) 
k=1 


when K -—> co. In most cases in statistical applications, n is large, hence 
the parameter set 

{w € W;p(w) > €} 
for some € > 0 is very a narrow subset of W, resulting that Riemann sum of 
the integral on the parameter space does not give effective approximation. 


Remark 51. (1) (Curse of dimensionality) If d = 1, then the integral can be 
approximated by Reimann sum, 


al K 
[ f(w)plw oe (k/K)p(k/K), (7.2) 
Ki 


which is more accurate than MCMC. However, if d = 2,3,4,..., then the 
number K¢@ necessary for approximation becomes too large to be calculated 
numerically. This difficulty is called “curse of dimensionality” . 

(2) (Importance sampling) If a function Ho(w) exists such that {w,} can be 
easily generated from po(w) «x exp(—Hpo(w)), then 


K 
» f (wr) exp(—H (we) + Ho(we)) 
[ Fe) f(w)plw)dw = ASE) 
Sex(-H (wx) + Ho(we)) 
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This method is called the importance sampling, which works well if H(w) ~ 
Ho(w). 

(3) In almost all cases, H(w) is given explicitly, however, Z, not. It is more 
difficult to calculate Z,, than estimating the average. 


In the Markov chain Monte Carlo (MCMC) method, a sequence {w, we, 
w3,-..} is generated by a conditional probability p(wz+1|wz) iteratively. It 
is known that (1) and (2) below are sufficient conditions for eq.(7.1) to hold. 
(1) (Detailed Balance Condition). For arbitrary parameters wa, wy € 
W, 

p(wo|Wa)p(Wa) = p(Walw)p(wo). 


(2) (Irreducible Condition). For an arbitrary w € W, the probability 
that a parameter of {w;,} is contained in the neighborhood of w is not equal 
to zero. 


Note that the detailed balance condition is not necessary for eq.(7.1), 
however, there are several MCMC algorithms which satisfy the detailed 
balance condition. Firstly, we study Metropolis method. 


7.1.1 Basic Metropolis Method 
Let r(wi|w2) be a conditional probability density which satisfies 
V(w1,wW2), 7r(wiy|w2) = r(we|w). (7.4) 


In Metropolis method, the set of parameters {w(t) € R¢;t = 1,2,3,...} is 
generated as follows. 


Metropolis Method. 
(1) Initialize w(1) and t= 1. 
(2) A candidate w’ is generated by r(w’|w(t)). 


(3) By using AH = H(w’)—H(w(t)), the probability P = min{1, exp(—AH)} 
is determined. Then set w(t+1) = w’ with probability P, or w(t+1) = 
w(t) with 1— P. 


(4) t:=t+ 1, and return to (2). 
We can prove that this procedure satisfies the detailed balance condition. 


Theorem 20. Metropolis method satisfies the detailed balance condition. 
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Proof. Let p(w(t + 1)|w(t)) be the conditional probability which is used in 
one step of the Metropolis method. To show the detailed balance condition, 
it is sufficient to prove 


p(wal ws) exp(—H (w»)) = p(ws|wa) exp(—H (wa)) (7.5) 


for an arbitrary set (wa, wy). For a given w(t), the simultaneous probability 
that w’ is generated from w(t) and that w(t+1) = w’ is 


r(w'jwo(t))min{1, exp(—H (w") + H(w(t)))}: 


Therefore, for a given w(t), the probability that the new candidate place is 
chosen is given by marginalization about w’, 


Q(w(t)) = J ro'o®)min GL exp(— Hw! + H(w(t)))}dw’. 
Hence the probability that w(t + 1) = w(t) is 1 — Q(w(#)), resulting that 


P(WalWo) = T(walwy)min{1, exp(—H (wa) + H(wp))} 
+d(wa — we)(1 — Q(we)). 


By using this relation, r(wa|w,) = r(wy|wa), and the property of the delta 
function, it follows that 


p(wWalwe) exp(—H(we)) = r(wal|we)min{exp(—H (wp), exp(—H (wa))} 
6(Wa — wp)(1 — Q(wy)) exp(—H (wp)) 

= r(wp|Wa)min{exp(—H (wy), exp(—H(wa))} 
+4(wp — Wa)(1 — Q(wa)) exp(—A (wa)) 

= p(wy|Wa) exp(—H(wa)), 


which completes the theorem. O 


Remark 52. (Metropolis-Hasting method) Metropolis method can be gener- 
alized for the case r(wa|wy) 4 r(wy|wa). For such a case, the probability P 
is replaced by 


r(w(t)|w') exp(—H(w")) \ 


P= min{l aie eas 


Then the detailed balance condition is satisfied. 
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Remark 53. Theoretically speaking, Metropolis method gives the set of pa- 
rameter which ensures eq.(7.1), if A — oo. However, in practical applica- 
tions, there are several issues. 

(1) The parameters in the period which is affected by the initial point should 
be removed from the obtained parameter set. Such a period is called ‘burn- 
in’. 

(2) The obtained parameters are not independent if MCMC is used. For 
the effective approximation of the posterior distribution, the dependency 
between parameters had better be reduced. Hence {w(mxt);t = 1, 2,...} for 
some m is chosen. If m is large, then dependency of the obtained parameters 
is made small, but it needs a computational cost. In this book, m is called 
a ‘sampling interval’. 

(3) If a probability distribution p(w) has several distant peaks, then the 
probability from a peak to another peak becomes very small, hence the ir- 
reducibility of MCMC often fails. This is called the problem of a ‘potential 
barrier’. 

(4) Let wo be the parameter that minimizes H(w). If the set {w € W; H(w)— 
H(wo) < €} is connected but not contained in some local region, then 
MCMC process sometimes fails because the probability from a place to a 
distant place is very small. This is called the problem of ‘entropy barrier’. 
(5) The probability that the candidate parameter w’ is chosen is called the 
acceptance probability. If the variance of r(w ,|w2) is small, then the ac- 
ceptance probability becomes high, but the candidate parameter is chosen 
in the narrow local region. If it is large, then the acceptance probability 
becomes small, but the candidate parameter is chosen from the wide range. 
In the Metropolis method, optimization of the acceptance probability by 
controlling r(w1|w2), is one of the most important processes for construct- 
ing MCMC. 

(6) Several criteria which judge whether the parameters could be understood 
as taken from the equilibrium state or not are proposed [29, 30, 33]. 


7.1.2 Hamiltonian Monte Carlo 


In the original Metropolis method, in order to ensure the acceptance proba- 
bility is not small, the dependency of w(t) and w(t +1) becomes large. The 
following method was devised to improve this property. 

Let w € R¢. A new variable v € R?@ is introduced and the total Hamil- 
tonian of (w,v) is defined by 


toa) = sll? LF): 
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If {(we, ve) } which is subject to the equilibrium state exp(—H(u, v)) of the 
total Hamiltonian, then {w;} is subject to the equilibrium state exp(—H(w)) 
of the Hamiltonian H(w). Thus we make {(wz,v,)} subject to the equilib- 
rium state of the total Hamiltonian. 


Hamiltonian Monte Carlo. 


(1) Initialize w(1) and t = 1. 

(2) The elements of v € R?@ are independently generated by the standard 
normal distribution. 

(3) The following differential equation with respect to the time parameter 


T is solved with the initial condition that (w(t),v) at 7 = 0. Here 7 is 
a variable which has no relation to MCMC time t. 


dw 
dt 
dv 
dt 


= Vv, 


= -VHA(w). 


This is known as the Hamilton equation which describes the dynamics 
of the canonical coordinate (u,v). Then (w’,v’) (w’ = w(r), wu’ = u(r7)) 
is obtained for a given time 7. It is permissible that the numerical 
solution of the differential equation contains errors. However, it should 
satisfy the invariance condition of the time reverse and the volume 
conservation condition of the phase space. It is known that leap frog 
method in Remark 54 satisfies both conditions. 


(4) By defining AH = H(w',v’) — H(w(t),v), then w(t + 1) = w’ with 
P = max{1,exp(—AH)}, or w(t + 1) = w(t) with probability 1— P. 
(5) t=¢+4+1. Return to (2). 


This method satisfies the detailed balance condition for exp(—H(w, p)). 
Note that the rigorous solution of the differential equation satisfies dH /dt = 
0, hence it is expected that the numerical solution gives AH ~ 0, thus the 
acceptance probability can be made higher and w’ can be generated at the 
distant place from w(t). 


Remark 54. (Leap frog method) The differential equation 


dw du 


a a 
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is numerically solved by the iteration, 


v(n+1/2) = v(n) +5 F(w(n)), 
win+1) = w(n)+ev(n+4+1/2), 
(n+l) = v(n+1/2)+5 f(w(n+1)), 


where € > 0 is a small constant. 

If € is made very small, then the differential equation is solved with 
high accuracy but the computational costs also become high. In Hamil- 
tonian Monte Carlo, the controlling the balanace between € and the ac- 
ceptance probability is necessary. Recently, an improved algorithm was 
proposed which determines them automatically by using non-U-turn Hamil- 
tonian Monte Carlo [38]. 


Example 51. For a probability density 
p(x, y) « exp(—Na?y* — Ma? — My’), 


where N = 50, M = 0.005, random variables {(X;, Y;)} (¢ = 1,2,...,300) are 
generated by Metropolis method, Gibbs sampling, and Hamiltonian Monte 
Carlo. (See Figure 7.1.) For the Gibbs sampler, see the following subsection. 
Note that, if M = 0, then f exp(—Na?y?)dady = oo, hence p(x, y) is not 
a probability density. The origin is a singularity of X?y?. In a normal 
mixture or an artificial neural network, such a singular posterior distribution 
on higher parameter space is necessary. By an experiment in which 10000 
random variables are generated, the empirical averages are compared, 


EveriX] = 0.076, 
Ever|Y] = 0.4528, 
Eorn|X] = 0.0289, 
icialY) = 0.017, 
Exam(X] = 0.091, 
ExamlY] = 0.089, 


where Eyer, Earp, and Ey ay mean the empirical averages oy the Metropo- 
lis method, Gibbs sampler, and Hamiltonian Monte Carlo. The true averages 
of X and Y are equal to zero. In the Metropolis and Hamiltonian methods, 
one sampling process consists of 100 trials and 100 dynamical calculations, 
respectively. In this case, the Hamiltonian has the form for which Gibbs 
sampler can be employed. However, in general it is not applied. In the cases 
when Gibbs sampling cannot be employed, Hamiltonian Monte Carlo gives 
the more accurate MCMC expectations. 
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5 0 5 5 0 5 
Probability Density Metropolis Monte Carlo 


“5 0 5 5 0 5 


Gibbs Sampler Hamiltonian Monte Carlo 


Figure 7.1: Comparison of a probability distribution, Metropolis method, 
Gibbs Sampler, and Hamiltonian method for p(x, y) «x exp(—Na?y?— Ma? 
My’). In many statistical models, the posterior distribution contains sin- 
gularities, hence MCMC processes for such cases are very important. 
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7.1.3 Parallel Tempering 


If the posterior distribution does not concentrate in some local region, the 
parallel tempering or replica Monte Carlo is sometimes employed. Let 


nLn(w) = — > log p(Xj|w). 
wl 


Note that this function does not contain the prior information, in other 
words, 
H(w) = nL,(w) — log y(w). 


The equilibrium state of the inverse temperature 3 > 0 is defined by 


p(w|B) = FH PC nbn w) ole). 


Then the posterior distribution is equal to p(w|1). Let the sequence of 
inverse termperatures be 


O= 6, < fo <---<fy=1. 


The target probability distribution of the parallel tempering is 


J 


[[2@il5). (7.6) 


j=l 


which is a probability density function of (w1, w2,...,wy). If parameters are 
taken from this distribution, then the set {w,s(t)} can be used for posterior 
distribution, because p(w |37) = p(w|1). The parallel tempering consists of 
two MCMC processes. 


Parallel Tempering. 


(1) One is the independent MCMC process for each p(w;|{;). 


W11 7 W12 > W137 °°* 


Wa1 — W22 — W23 7 + °° 


WIL +> WJ2 + WI3 7+" 


In this process, arbitrary MCMC method can be used. 
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(2) The other is the exchange process between w; and w;,, with some 
interval in each MCMC process. The probability of the exchange is 
given by 

min{1, exp{(3j41 — Bj)(nLn(wj41) — nLn(wj))}, (7-7) 


which satisfies the detailed balance condition for the probability den- 
sity of eq.(7.6). Note that the prior does not affect the exchange 
probability. 


Even if p(w,|87) have many peaks, p(w,|8;) for small 6; does not, hence the 
equilibrium state can be more easily realized by exchanging parameters. 


Theorem 21. Parallel tempering using the exchange probability of eq.(7.7) 
satisfies the detailed balance condition. 


Proof. Let us use (u,v) for (w;,w +1) and (a, 8) for (G;,6;41). The target 
probability distribution is 
P(u,v) x exp(—anLn(u))p(u) exp(—BnLn(v)p(v)) 
= exp(—f (2, y)), 
where 
f(u,v) = anLy(u) + BnLy(v) — log p(u) — log y(v). 
By the exchange (u,v) > (v,u), 
Af f(v,u) — f(u,v) 
anLn(v) + BnLn(u) — anL,(u) — BnLn(v) 
(8 — a)(nLn(u) — nLn(v)). 
Therefore the exchange process whose probability is defined by 
min{1,exp(—Af)} 


satisfies the detailed balance condition. O 


Remark 55. Assume that the posterior distribution has the real log canon- 
ical threshold \.. The exchange probability between (1, 82, (G2 > {) is 
asymptotically (for n + oo) given by the following formula, [52]. 
1 — 8 T(A+1/2 
P(61, 62) =1- 1 fe Bi T(A+ 1/2) 
vr By P(A) 

If the sequence {(;} is set as a geometric progression, the exchange proba- 
bility becomes a constant for a sufficiently large n. 
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7.2 Gibbs Sampler 


Metropolis method can be applied to any Hamiltonian function, however, 
it is not easy to generate parameters globally. Hamiltonian Monte Carlo 
improved that difficulty. Although the Gibbs sampler can be used in the 
special posterior distributions, if it can be employed, it is rather easy to 
generate parameters globally. 

In the Gibbs sampler, a parameter w € R?@ is divided as w = (w1, we). 
Let the posterior distribution be p(w, w2). Then two conditional probability 
distributions p(w1|w2) and p(w2|w1) are defined from p(wj,w2). The set of 
parameters {w(t) = (w1(t), we(t)) € R4;t = 1,2,3,...} is generated by the 
following procedure. 

Gibbs Sampler. 
(1) Initialize w(1) = (wi(1), wa(t)). t =1. 
(2) One of (A) or (B) is chosen with probability 1/2. 
(A) wi is generated by p(w|w1(t)), then w is generated by p(w{|w4). 
(B) wi} is generated by p(w |we(t)), then ws is generated by p(w|w). 
(3) Set wi41 = (w}, ws) and t:=t+1. Return to (2). 


Theorem 22. Gibbs sampler satisfies the detailed balance condition. 


Proof. The probability density of (w{,w4) for a given set (w,w2) is given 
by 
1 
P(wi, welt, we) = 5{p(we|wi p(w) wa) + p(w; |w>)p(wa|ei)}. 


Let us prove the detailed balance condition, 
p(w}, ws|w1, w2)p(w1, We) = p(wi, we|wy, w2)p(wi, wd). 
By using the definition of the conditional probability, 


aise = ee plu, ua) 
p(w), w2) p(w, we) 
aa a) 


= p(w), w)p(we|w)p(w1|we) 


/ / 


= p(w, Wo) 


By the same method, 


p(w; |w)p(wa|wi)p(wi,w2) = p(w, w2)p(wi|w))p(we|wr). 


By the sum of these equations, this theorem is completed. O 
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Remark 56. (1) In the above definition, the order of sampling w; and wy is 
chosen by the same probabilities 1/2. If the order is fixed, then the detailed 
balance condition is not satisfied, however, the same equilibrium state can 
be obtained. 

(2) If one of two procedures p(w|w2) and p(w{|we) is chosen with the same 
probability, then the conditional probability is given by 


1 
p(w}, W5|w1, we) = F{p(wy|we)d(wy — we) + p(ws|w1)d(w — w1)}, 
2 


which also satisfies the detailed balance condition. 


7.2.1 Gibbs Sampler for Normal Mixture 


Gibbs sampler is often employed in the mixture models. Let us derive an 
algorithm by Gibbs sampler for a normal mixture. In this subsection, a 
normal distribution of zc € R™ for a given b € R™ is denoted by 


|x — BI? 


N(a|b) = (anya xP(- 5) ). 


Then a normal mixture is defined by 


K 
p(aia, b) _ YS anN(albx); 
k=) 


where a = (aj, 42,...,aK) and b = (b1, bo,...,bK) are parameters of a normal 
mixture, which satisfies }°a; = 1 and a; > 0, and by € R¢. For the prior, 
we adopt 


1 kK 
g(a) = —][(ax)*, 
71 nt 
i, 1 
22) eh ee 2 
y(b) = le spz libel”); 


where y(a) and y(b) are the Dirichlet distribution with index {a,} and the 
normal distribution respectively. Here {a;,} and o? > 0 are hyperparameters 
and 21,22 > 0 are constants. Let y = (yD, y), yD) be a competitive 
variable, in other words, y takes value on the following set, 


Ce =11.0.220), (10.2.0). (0,0 O61) ). 
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Then a statistical model for the simultaneous probability density of (x, y) is 
defined by 
4th 


p(x, yla,b) = [1 {euw(elo.)} 


> 
ll 
mn 


It follows that 


p(z|a, b) _ S> p(x, yla, b). 


yeCK 


Therefore a normal mixture p(z|a, b) can be understood as a statistical model 
p(x, ya, 6) which has a latent or hidden variable y € Cx. 
Let x” = {21,22,...,2,} be an independent sample and y; € Cx be 


the competitive variable which corresponds to a sample point x;. We use 
(k) 


a 


a notation y” = {y1, y2,---,Yn}. The kth element of y; is denoted by y 
Then Bayesian simultaneous probability is 


pla, b,2",y") = pla)e(o) [] pc. yila, b) (7.8) 


fag exp(— Pe TTfaxiveib}” | (7.9) 


=e TI ala ‘II exp(—Hx(bx))| ; (7.10) 
k=1 k=1 


where Z is a normalizing constant and 


(be) = ye xl’ 


BS 


In Bayesian estimation, we need the posterior parameters {(a,b)} which 
are subject to p(a, bla”). If {(a,b,y”)} are subject to p(a,b,y"|x"), then 
{(a,b)} of them can be used for numerical approximation of the posterior 
distribution. 

Therefore it is sufficient to make a Gibbs sampler for (a,b) and y”, in 
which the conditional probabilities we need are 


ply” |a,b,6"), ple, bz, 9"). 
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In fact, by using these conditional probabilities, we can make a Gibbs sam- 
pler for (a,b) + y” and y” + (a,b). By eq.(7.9), under the probability 
distribution p(y"|a, b, 2"), y1, y2,---; Yn are independent and, for each 7, 


(k) 
p(ys"|a, b, 23) oe [axN (2ilbx)] 
In other words, 
ap.N(x;|bp) 
p(xila,b) 
On the other hand, if (a, by, b2,...,b%) is subject to the probability distribu- 
tion p(a, blz”, y”), they are independent by eq.(7.10). Hence 


p(y = 1a, b, x) = (7.11) 


k 
p(a, bla”, y” = p(alx” i 1» (by |2” Ly”) 


The variable a is subject to the Dirichlet distribution with index a, + nz. 


K 


plale",y") = >|] age]. (7.12) 
k=1 
By using 
1 P 
Hy(be) = 5(xy + r0)[lbul? — - (Lu 1%) by + Const. 


1 
: 
a(os : + nx) )[ox - (dou ni) (F <= +n) )|/ + Const., 


the variable by is subject to the normal distribution with average b; and 
variance (o7)?, 
P(be|x",y") = N Op, (o%)”), (7.13) 


where 


bi 2 ai) (+m), 
(ot? 


Hence we obtained the Gibbs sampler from eqs.(7.11), (7.12), and (7.13). 


WM + nx). 
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Gibbs Sampler for Normal Mixture. 
(1) A parameter set {(a,,b,);k = 1,2,..., A} is initialized. 
(2) A set of hidden variables {y; = fy} is determined by the probabil- 
ity, 
ay (xi|bx) 
p(ai|a,b) 
(3). Usmg ng = oy yS*), a parameter a = {ax} is generated by using 


Dirichlet distribution, 


p(y” = 1a, 6, 2) = 


K 
1 = 
plalz”, y”) = 7 TI a " : 
k=1 


(4) A parameter {b,} is generated by the normal distribution N (bz, (o7)?) 
where 


oa 
ae 
II 
M 
< 
os 
& 
8 
s 
ae” 
ks 
— 
‘I 
+ 
S 
Cd 
ee 


— 
9 
a 
" 
to 
II 
ee 
ao 
— 
i 
+ 
S 
Cr 
Ya 


(5) Return to (2). 


Example 52. An experiment is conducted for the case M = 2, K = 2, 
ap = 1, and n = 100. In Figure 7.2, for the four different true distribu- 
tions, the posterior parameter sample points of b) = (bj1,612) and bz = 
(b21,b22) are displayed. The centers of the true distributions (0.2B,0.2B) 
and (—0.2B,—0.2B) for B = 4,3,2,1 are shown by the white circles. The 
true paramater of a is ag = 0.5. For B > 3, the posterior distributions are 
localized, whereas for B < 2, they are singular. Both the cross validation 
loss and WAIC can be applied to all cases, because both criteria can be used 
without normality of the posterior distribution. 


7.2.2 Nonparametric Bayesian Sampler 


The Gibbs sampler for mixture models can be extended as a nonparametric 
Bayesian sampler. 

Firstly, we derive an MCMC method for a mixture model which is mathe- 
matically equivalent to the Gibbs sampler. The marginal probability density 
function of (b, y”) is given by 


pb, 4" |e") = [rla.beuz")da 


222 CHAPTER 7. MARKOV CHAIN MONTE CARLO 


Posterior. o : true parameter 


2 

1 

2 > ta .O 
) a. 
-1 . 
-2 -2 : 
-2 ) 2 -2 ) 2 

Posterior. o : true parameter Posterior. o : true parameter 


Figure 7.2: Posterior distributions of normal mixtures (n = 100) are dis- 
played. The true distribution is a normal mixture which consists of two 
normal distributions with centers are indicated by the white circles. As 


the distance between two circles is made smaller, the posterior distribution 
becomes singular. 
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If we obtain the MCMC sample from P(b, y"|x”), then the posterior distri- 
bution of {a} or its average can be obtained by eq.(7.12). Therefore it is 
sufficient to make a Gibbs sampler for p(b, y”|x”), which requires p(b|x”, y”) 
and p(y"|b, 2”). The former is equal to the direct product of eq.(7.13). To 
derive the latter, by using 


kK K K 
np = D(a, a Nk) Pet a) 
azy*ola\da.= —f=*— eee 
/ [] x) e0)da = DE ae) TIE, Fon) 


and eq.(7.9), 
lpn This P(e + ox) ae () 
p(d, yx") tarysa) 1 TTD eet |. 
Hence 


n 


K 
p(y" |b, 2") ox II Pim + a) [Leite]. 
k=1 i=1 


Let Nz(i) be the sum of ys” whose sample point number is not larger than 
i. That is to say, 


Then N;(n) = nz and 
“ (k) 
Tine tax) = Tax) [[ (ox + Ne(é) - 1)". 


(k) 


Since '(aj,) is a constant function of y;"’, 


n K 
lu"[b,0") o TT [T] {lax + Nei) — 1. (aalbs)}"”]. 


i=1 k=1 


By using 


K 
Ply) x [] flax ty — iN (wilde), (7.14) 
R=1 


(k) 


p(yily*) « YL] {lax + Ne (é) — IN (ailb.)}% (7.15) 


Pas 


> 
ll 


1 
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The random variable (y1, yo, ..., Yn) can be generated by iteration, 


P(y"|b, 2") = p(y) p(yalyn) P(yslyn.¥2)---PCnly"™*). 
Note that eq.(7.15) means 


(t) yy = __ oN (Pilbn) 

oo : 7 ae ap N(x1|bp)’ (7.16) 
() _yyyicty — (an + Neli- D)N@ilbx) 

ra Slat Nee DN Cele) NY 


That is to say, y; is determined by N;,(i — 1) which is the cumulative sum 
of y1, Y2,---,Yi-1- If Nz(i — 1) is large, then the probability that yl”? = 1 
is also large. This stochastic procedure is called “the Chinese restaurant 
process”, where 7 is a guest of a resaurant and k is the number of a table. 
The ith guest determines a table according to the numbers of persons sitting 
at tables. 

If K + oo and ax = a/K, then this Gibbs sampler determined by 
eqs.(7.13), (7.16), and (7.17) gives the statistical estimation of the nonpara- 
metric Bayesian method. We obtained the following algorithm. 


Nonparametric Bayesian Sampler for Normal Mixture. 
(1) A parameter set {(ax,b,);k = 1,2,..., A} is initialized. 
(2) A set of hidden variables {y; = fy} is iteratively determined by 
the probability, 
or, N (21x) 
din ON (21 |x)’ 
(ap + Nx (i — 1))N(a;|bz) 
din(on + Ne(é — 1))N(xilbe)’ 


py) =1) = 


p(y” = 1y*?) 


where N;(i) = a y\”. 
(3) A parameter {bj} is generated by the normal distribution N (bt, (o7)?) 
where 


te = (ua) +m) 


(oi? = Wat), 


(4) Return to (2). 
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Remark 57. (1) This algorithm is a Gibbs sampler for (b, y”), whereas the 
previous one was applied to (a,b, y”). The expectation operation over a is 
analytically performed. 

(2) In statistics, a nonparametric estimation of a density function of X for 
a given X” is usually defined by 


n 


wo) = 2 So 8), 


gl 


where p(x) is some kernel function such as a normal distribution, and a is 
an optimized controlling parameter. In this method, the estimated density 
is a mixture of n functions. The nonparametric Bayesian method is formally 
defined by the mixture of the infinite number of functions, however, it re- 
quires the very small {a, = a/K}, so that the number K essentially used 
in MCMC is finite. The optimal hyerparameter a that minimizes the gener- 
alization loss can be evaluated by the cross validation and WAIC. In order 
to ensure the generalization loss is smaller, infinite components should be 
controlled close to zero, therefore the prior effect should be made stronger. 
The generalization error by the mixture model with the appropriate finite 
number of components is smaller than that from the mixture of infinite 
components. 


7.3 Numerical Approximation of Observables 


By using the Markov chain Monte Carlo method, we can numerically calcu- 
late Bayesian observables. 


7.3.L Generalization and Cross Validation Losses 


Let {wz; k = 1,2,...,K} bea set of posterior parameters. The generalization 
loss is numerically approximated by 


where {X;} is a set of random variables which are independent of the sample 
used in the posterior distribution. In general T should be very large, T >> 
n, in order to minimize the fluctuation. ISCV indexISCV and WAIC can 
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also be approximated numerically. 


IsSCV = ye 
n Kk & p(X; |w_) 7’ 


i=1 k=1 
WAIC = “adhe ni) 
is a ieee 2 
Ve = =S [5 > (loer¥ihn)) — (= Yo log p(Xileve)) |. 
i=1 k=1 


where, in calculation of Ps ae /(K —1))V, is more appropriate than V,,, be- 
cause it estimates the sum of the variances of log p(X;|w,) over the posterior 
distribution. 


Remark 58. In order to calculate the above observables, we need 
Ap Asp Gea 172. at Sh oe 


Several softwares have parallel computation architectures. In such a case, 
for every parameter wz, the set 


P(Xi\we), P(X2\we), +++ p(Xn|we) 
can be simultaneously calculated. Also for every X;, the set 
P(Xilw1), p(Xilwa), +--+ p(Xilwx) 


can be simultaneously calculated. Once {p(X;|w;)} is obtained, then the 
above computation is not so heavy in general. For neural networks and nor- 
mal mixtures, this method is reeommended for reducing the computational 
costs. 


7.3.2 Numerical Free Energy 


Even if the posterior parameters {w;} are obtained, it is not enough to 
numerically estimate the free energy or the minus log marginal likelihood. 
Here we study a method to calculate them. 


Definition 24. Let 6 > 0. The average of an arbitrary function f(w) over 
the generalized posterior distribution is defined by 
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Then the case 3 = 1 is equal to the posterior average, a(t yi [=Bel.l 


Theorem 23. By using the minus log likelihood function, 
1 nm 
Ln(e) = —5 log lib) 
— 


The free energy or the minus log marginal likelihood is given by 


1 
~ | Enz, 
Fy, / w [RLn(w)|dB 


Proof. Let us define a function F'(3) 


n 


F4(8) = — log f ow) T[ o(Xilw)%au 


i=1 


Then F,,(0) = 0 and 


By using, 


n nm 


. (T]pxitwy)” = tox (T] (Xf) (TT plu)” 


i=1 i=1 i=l 


it follows that 


1 
(1) = / ap Oo () 


ll 
<< 
2 
Q 
D 
3 
il 
ua 


which shows the theorem. O 


A calculation method of the free energy is derived by the same method 
as the above theorem. 


Let {6,;k =0,1,..., J} be a sequence, 


0=f9<fi<---<By=1. 
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J-1 
mo) = Tae) 


j=0 
1 


= [ae [en Bea —Fx)ntn(e)] (7.18) 


The free energy is given by 
Fi(1) = —logZ,(1) 


= => log Bee Rea tana) ; (7.19) 


Here we need the posterior distribution for /1,...,G7—1 
EO), |, BA), |... BV | 


which can be obtained by the parallel tempering. 


Remark 59. This method sometimes involves heavy computational costs. If 
the posterior distribution can be approximated by some normal distribution, 
then eq.(4.57) can be applied, otherwise WBIC can be employed, however, 
the difference between the free energy and WBIC is loglogn or constant 
order. If minimizing the free energy according to a hyperparameter, then 
the derivative of F,, by the hyperparameter can be calculated for the smaller 
computational cost. 


Remark 60. By using a probability density function, 


plw) = Folw) exp(-nEn(w)), 


the expectation E,,| | is defined by p(w), Then 


1 


4= F lexpinl wy)” 


However, this method is not appropriate for calculating Z because exp(nL,,(w)) 
takes the large values at the small p(w). 
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-2 -1 (0) 1 2 ae 1 (0) 1 2 


o: Start, *:end o: Start, *:end 


Figure 7.3: Examples of trajectories by Hamiltonian equation. The trajec- 
tories depend on initial conditions. The label ‘o’ and ‘*’ show the start and 
end points respectively. 


7.4 Problems 
1. Let us study a probability density on W = [—2,2] x [—1,1] defined by 
p(u,v) x exp(—nv?(u + 1)?(u — 1)4). 
Then the set of all points which attain the maximum of p(u, v) is 
Wo ={(u,v) EW v(u+1)(u—1) =O}. 


Prove that, when n — oo, this distribution coverges to 6(u—1)d(v). Explain 
why it does not coverge to the all neighborhoods of Wo. 


2. For the same Hamiltonian function used in Example 51, the trajectories 
of Hamiltonian equation are shown in Figure 7.3. Explain the reason why 
Hamiltonian Monte Carlo can cover the entire parameter set. 


3. In Example 52, the posterior distributions of the parameter a are not 
displayed. For each case, explain the shape of the posterior distribution of 
a. Also answer whether it is localized or not. 
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Chapter 8 


Information Criteria 


In the foregoing chapters, we derived the theoretical behaviors of Bayesian 
observables for a given set of a true distribution, a statistical model, and 
a prior, (q(x), p(z|w), p(w)). In the real world, we do not know the true 
distribution q(x), hence we need methods to estimate observables without 
any information about q(x). Information criteria are made to overcome such 
problems. In this section we explain several information criteria from the 
two viewpoints, model selection and hyperparameter optimization. In each 
viewpoint, the properties of the generalization loss and the free energy or 
the minus log marginal likelihood are investigated. This chapter consists of 
the following contents. 


e Model Selection 


— Generalization Loss: CV, AIC, TIC, DIC, WAIC 
— Free Energy: F, BIC, WBIC 


e Hyperparameter optimization 


— Generalization Loss: CV, WAIC 
— Free Energy: F, DF 


8.1 Model Selection 


In this section we study a model selection problem. When we have sev- 
eral candidate models and need to select one of them, the model selection 
problem occurs. There are two methods in model selection, minimizing the 
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generalization loss and the free energy. The aim of minimizing the general- 
ization loss is equivalent to minimizaing the Kullback-Leibler distance from 
the true density and the predictive one, whereas that of the minimizing the 
free energy is to maximize the posterior probability of a statistical model 
and a prior for a given set of data. 


8.1.1 Criteria for Generalization Loss 


Let us introduce the definitions of several information criteria which are used 
for estimation of the generalization loss. Since the generalization losses for 
Bayesian, maximum likelihood, maximum a posteriori, and posterior mean 
methods are different, we have to understand which generalization loss an 
information criterion estimates. 


Remark 61. In this book, the information criteria are defined as estimators 
of the generalization loss 


By [log A(X)], (3.1) 
where p(x) is the estimated probability density of x by a statistical esti- 
mation method. Since the original Akaike information criterion AIC was 
defined to estimate 


~2n x Exlog 6(X)], (8.2) 


resulting that many information criteria were normalized so that they es- 
timate the same scale loss as AIC. If one needs information criteria which 
have the same scale loss as AIC, 2n times values shown in this book should 
be used. If eq.(8.1) is used, then the difference between candidate models is 
measured by the scale according to the Kullback-Leibler distance, whereas, 
if eq.(8.2) is used, then it is measured by the scale according to the number 
of parameters. 


Definition 25. The leave-one-out cross validation criterion CV and the 
importance sampling cross validation criterion ISCV are respectively defined 
by 


1 i 
CV = -—) Tog El” [p(Xilw)), (8.3) 
i=1 
1 n 
ISCV = aa wll /p(Xi|w)], (8.4) 


where E,,| | and Oe v1 | show the ordinary posterior average and the poste- 


rior average leaving X; out, respectively. Both criteria estimate the Bayesian 
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generalization loss, if {X;} are independent. If the posterior distributions in 
the above definitions are exactly realized and the averages are finite, then 
CV = ISCV. However, if they are numerically approximated, for example 
by Markov Chain Monte Carlo, then CV 4 ISCV. In order to calculate 
CV, all posterior distributions using X” \ X; for i = 1,2,...n are necessary, 
whereas ISCV can be calculated by one posterior distribution using X”. 


Definition 26. The Akaike information criteria (AIC) and that for Bayes 
(AIC,) are defined for the maximum likelihood and Bayesian methods re- 
spectively, 


1 d 
AIC = —~YS logp(X;|\w) 4+ -, 
C 7 2 Br |e) + = (8.5) 
AICy = = y E pObluie= (8.6) 
b = - am Og lew a a : 


where w is the maximum likelihood estimator. The Takeuchi information 
criteria (TIC) and that for Bayes (TIC,) are defined for the maximum like- 
lihood and Bayesian methods, respectively, 


a db p(X; |’) + —tr(I(d)J(d)—4), (8.7) 
ric, = —2 Ye Ey lp(Xilw)] + ero) I), (88) 
where @ = E,,[w] and 
Hw) = Wlogpl%lw)(T og r(%l0))" (6.9) 
fey = -= Ss V7 ox n Xl): (8.10) 


i=1 


Note that the original AIC and TIC are criteria for the generalization loss of 
the maximum likelihood method, whereas AIC, and TIC, are their modifi- 
cations for Bayesian estimation. In general, the generalization loss of Bayes 
is different from that of the maximum likelihood method and AIC ¥ AIC, 
and TIC ¥ TIC». 
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Definition 27. The deviance information criterion (DIC) is defined by 


9 n 
DIC = 7 Les (Xilt0) -— ) Evllogr(Xilw)], (8-11) 
i=1 
where @ = E,,[w]. It seems that DIC is made for estimating the gener- 


alization loss of p(x|W) rather than the predictive distribution E,, |[p(z|w)]. 
However, it might be employed for Bayesian or the maximum a posteriori 
method. The widely applicable information criterion (WAIC) is defined by 


WAIC = -— log Ew[p(Xi|w)] + — > Vwllog p(Xilw)], (8-12) 
= 4=1: 


which estimates the generalization loss of the predicitive density. 


The behaviors of information criteria depend on the condition of a true 
distribution q(x), a statistical model p(z|w), and a prior y(w). Let us con- 
sider (A) a regular and realizable case, (B) a egular and unrealizable case, 
and (C) a nonregular case. 


(A) Regular and Realizable Case 


If a true distribution is realizable by and regular for a statistical model, 
and if the posterior distribution can be approximated by some normal distri- 
bution, the generalization and training losses by Bayes, maximum a poste- 
riori, posterior mean, and the maximum likelihood methods have the same 
asymptotic expansion as 


BGn] = E(w) + 3 + o(1/n), 
E[T,] = D(wo) - & + 0(1/n), 


where d is the dimension of the parameter. In this case an arbitrary criterion 
of CV, ISCV, AIC, AIC,, TIC, TIC,, DIC, and WAIC satisfies 


E|Criterion] = L(wo) + = fe 
2n 


o(1/n). (8.13) 


Hence arbitrary information criteria can be employed. The asymptotic stan- 
dard deviations of all criteria are also equal to each other. 


(B) Regular and Unrealizable Case 
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If a true distribution is regular for but unrealizable by a statistical model, 


we define 


where 


Io 


iy = = [ V? tog p(Xi|wo)de. 


1 _ 
Ly = 5 tt Lodo . 


; V log p(X; lwo)(V log p(Xz|wo)" de, 


If a true distribution is realizable by a statistical model, then vp = d/2. If 
the posterior distribution can be approximated by some normal distribu- 
tion, then the average generalization and training losses of Bayes and the 


maximum likelihood methods are 


B[G,,(ML) 


E(T;,(ML) 


- £ + ot/n), 


+ + o(1/n), 


E(w) — > + o(1/n), 


(8.17) 


where G,,(ML) and T,,(ML) are the generalization and training losses of the 
maximum likelihood method, respectively. The generalization and training 
losses of the maximum a posteriori and posterior mean methods are asymp- 


totically equal to those of the maximum likelihood. 


The averages of the cross validations are equal to the generalization loss 


asymptotically, 


Since E[T,,] = L(wo) + (d— 


(8.18) 


(8.19) 


(8.20) 


(8.21) 
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By the definitions of TIC and TIC,, 


E[TIC] = L(wo) +> + o(1/n), (8.22) 
E[TIC;] = Lu) + + o(1/n). (8.23) 


By using eq.(4.55) and eq.(4.56), 


“+ o(1/n), (8.24) 


E[DIC] 


] 
pa 
g 
= 
-b 


E[WAIC] = L(wo) + = + 0(1/n). (8.25) 


Therefore, CV, ISCV, TIC,, and WAIC can be used for estimating the 
Bayesian generalization loss. To estimate the generalization loss of the max- 
imum likelihood method, TIC is available. 


(C) Nonregular Case 


If a true distribution is not regular for a statistical model, let \ and v 
be the real log canonical threshold and a singular fluctuation defined by 


A = Real Log Canonical Threshold, (8.26) 
1 
y= ke [Fluc(€)]. (8.27) 


Also let be a constant defined by Theorem 19. Then 


Ga) = E(w) + = etija), (8.28) 
ol = Page 4c: (8.29) 
E[Gn(ML)] = L(wo)+ . +0(1/n), (8.30) 
E|T,(ML)] = L(wo)— +0(1/n). (8.31) 


In general, 4 >> A, hence Bayesian estimation attains the smaller gen- 
eralization loss than the maximum likelihood, maximum a posteriori, and 
posterior mean methods. In this case, the average cross validation loss is 
equal to the generalization loss, 


E[CV] = L(wo) o(1/n), (8.32) 


E[ISCV] 


L(wo) 


o(1/n). (8.33) 


S|[>3|~- 


8.1. MODEL SELECTION 237 


By the definition of AIC and AIC), 


E[AIC] = L(wo) + —* + o(1/n), (8.34) 
E[AIC] = E(u) +A 5 (1 /n, (8.35) 


Since TIC and TIC, are undefined because J(w) is not invertible, 


E(TIC] 
E[TIC,] 


Undefined, (8.36) 
Undefined. (8.37) 


The posterior average parameter E,,[w] is not in the neighborhood of the 
optimal parameter set, hence there exists C > 0 such that 


E[DIC] = L(wo)+C+o(1), (8.38) 
E[WAIC] 


L(wo) + A + o(1/n). (8.39) 


In this case, the Bayes generalization loss is estimated by the cross valida- 
tion and WAIC. Note that in nonregular cases, any information criterion is 
not yet known which can estimate the generalization loss of the maximum 
likelihood, maximum a posteriori, and posterior mean methods, because the 
posterior distribution is far from any normal distribution. It seems that the 
constant js cannot be estimated because it depends on the optimal parameter 
Wo. 


Remark 62. The above results hold for the assumption that a sample X” 
consists of independent sample points. For a case when {Y"} is condition- 
ally independent for a given x”, then information criteria can estimate the 
generalization loss if it can in independent cases, whereas the cross vali- 
dation criterion cannot. For example, the cross validation loss cannot be 
employed in a linear prediction of time series, whereas information criteria 
can be. 


Remark 63. From the mathematical point of view, the information criteria 
need asymptotic condition n > oo. In fact, AIC, AIC,, TIC, TIC,, and DC 
require the asymptotic normality, resulting that the sample size n should be 
large enough. However, WAIC does not require the asymptotic normality, 
hence it can estimate the generalization loss even if n is not so large. In 
many singular statistical models, WAIC can estimate the generalization loss 
even if n is small experimentally. 
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Example 53. Let x € R? and N(x) be a normal distribution on R? whose 
average is zero and covariance matrix is the 2 x 2 identity matrix. 


1 il 
N(a) = = exp(-5 lla 2). 


We study a statistical model 
p(azla, b,c) =aN(ax — b) + (1—a)N(a— 0), 


where 0 < a < 1 and b,c € R? are parameters. For a prior of a, we adopt a 
Dirichlet distribution, 


gra) x (a(1 — a)", 
where a > 0 is a hyperparameter. For a prior of (b,c), 


ll? + llell? 


9 B2 ); 


(p2(b, c) x exp(— 


where B is a hyperparameter. Several cases are studied experimentally. 
(1) Regular and realizable case. 


q(x) = p(a|ao, bo, co), 


where ag = 0.5, bp = (2,2), and c = (—2, —2). 
(2) Regular and unrealizable case. 


q(x) = aopN((ax — bo)/o)/o + (1 — ao) N((x — e0)/o)/o, 


where ap = 0.5, o = 0.8, bo = (2,2), and cp = (—2, —2). 
(3) Nonregular and realizable case. 


q(x) = p(x\ao, bo, co), 


where ap = 0, bp = (0,0), and c = (0,0). 
(4) Nonregular and unrealizable case. 


q(x) = aoN ((@ — bo)/o)/o + (1 — a9)N ((x — ¢0)/0)/o, 


where ap = 0, o = 0.8, bo = (0,0), and c = (0,0). 
(5) Delicate case. 


q(x) = agN ((x — bo)/o)/o + (1 — ao) N((x — co) /0)/o, 
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[Cases [BCVA] DIO WATT 
(1) Regular | Ave | 0.0254 | 0.0254 | 0.0251 | 0.0247 | 0.0254 
(2) Regular | Ave | 0.1089 | 0.1043 | 0.1184 | 0.1110 | 0.1043 


1 
2 
(4) Nonre Ave | 0.0983 | 0.1036 | 0.1409 | 0.1049 | 0.1036 
Unreal Std | 0.0067 | 0.0399 | 0.0412 | 0.0455 | 0.0399 
(5) Delicate | Ave | 0.0384 | 0.0384 | 0.0479 | 0.0001 | 0.0387 
Std | 0.0175 | 0.0239 | 0.0232 | 0.0537 | 0.0241 
(6) Unbal. Ave | 0.0276 | 0.0255 | 0.0343 | -0.1618 | 0.0225 
Std | 0.0169 | 0.0267 | 0.0156 | 0.3568 | 0.0235 
Table 8.1: Experimental results in Example 53. In the table, averages and 


standard deviations of normalized values G-S, ISCV-S,,, AIC, -S;,, DIC -Si,, 
and WAIC-S,, are displayed. 


(3) Nonreg. | Ave | 0.0129 | 0.0160 | 0.0418 | 0.0034 | 0.0158 
Realizable Std | 0.0085 | 0.0088 | 0.0118 | 0.0283 | 0.0088 
g. : 


where ap = 0.5, o = 0.95, bp = (0.5, 0.5), and c = (—0.5, —0.5). 
(6) Unbalanced case. 


q(x) = ao N(x — bo) + (1 — ao) N(x — €0), 


where ag = 0.01, bp = (2,2), and c = (—2, —2). 
In each case, the average and empirical entropies of the true distributions 
are defined by 


So = — | aa) tog a(z)ax, 
1 nm 

Son = —= ) log a( Xi). 
i=1 


For the case n = 100, the posterior distributions were built by the Gibbs 
sampler, in which the burn-in was 200 and the number of posterior param- 
eters were 1000. Hyperparameters were set as a = 0.5 and B = 10. We 
conducted 100 independent experiments for each condition. In Table 53, 
‘Ave’ and ‘Std’ show their averages and standard deviations of G, — S, 
ISCV — S,, AIC, — S,, DIC — S,,, and WAIC—S,. 

(1) Ifa true distribution was realizable by and regular for a statistical model, 
then all information criteria estimated the generalization loss well. 
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(2) If a true distribution was unrealizable by and regular for a statistical 
model, then AIC overestimated the generalization loss. 

(3) through (5) In nonregular and delicate cases, ISCV and WAIC were more 
accurate than AIC and DIC. 

(6) In an unbalanced case, ISCV, WAIC, and AIC were more accurate than 
DIC. Note that a few sample points were generated from the first component. 
As an unbiased estimator, ISCV was better than WAIC and AIC, however, 
the variance of ISCV was larger than WAIC and AIC. AIC had the smallest 
variance. Their intervals [m — 20,m + 20] where m and o are averages and 
standard deviations were 


G-—S : [-0.0062, 0.614] 
ISCV—S, : [—0.0279, 0.0789] 
AIC—S, : {0.0031, 0.0655] 

DIC—S, : [—0.8754, 0.5518) 
WAIC—S, : [—0.0245, 0.0695] 


Therefore, not only the cross validation loss but also information criteria 
contain important information. 


Example 54. Examples of model selections are shown in sections 2.4 and 2.5. 
If a statistical model has hierarchical structure or hidden variables, then the 
posterior distribution cannot be approximated by any normal distribution 
in general, hence we can apply the cross validation and WAIC, but not 
AIC or DIC. If we need a neural network with many hidden units or a 
normal mixture with many components, the MCMC process sometimes fails 
because of local minima. If such models have a few redundant hidden parts, 
the MCMC rather easily attains the posterior distribution. If Bayesian 
estimation is applied to such statistical models, the generalization losses do 
not increase much, hence we recommend a model which has a few redundant 
parts. 


8.1.2. Comparison of ISCV with WAIC 


In typical experiments, ISCV is almost equal to WAIC. First, we show that 
if a sample consists of independent random variables, then ISCV and WAIC 
are asymptotically equivalent as random variables. Let 7;,(a) be a function 
defined in eq.(3.11) in Definition 8. 


Theorem 24. Assume that X1, X2,...,.Xn are independent and that 


col()"n0) = 01m 


n 
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Then the following equation holds, 
1 
ISCV = WAIC + Op(~5). 


Proof. ISCV is defined by 


12 
ISCV = — dbs Ew[ 1/p(X;|w) ]. 
By the definition of 7;,(a@) in eq.(3.11) 
IsCV = 75(—1). 
By using the mean value theorem, there exists |3*| < 1 such that 
Tall) = ~Fyi(0) + $73l(0) — 57, (0) + 5576"), 


On the other hand WAIC is defined by 


WAIC = Tht+Vn =—Ta(1) +7," (0). 


By using the mean value theorem, there exists |5**| < 1 such that 


- — _qupy— lpg — leq) Log 

Th = SEO) G Oi 
Hence ; : : 

WAIC = —Ta(0) + 5 Tn (0) — GTO) — pg TB), 


which completes the theorem. 
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(8.41) 


O 


Remark 64. (1) This theorem holds, even if the posterior distribution cannot 
be approximated by any normal distribution. By the proof, it is also derived 


that 1 
ISCV = WAIC + re (0) + 0,(n-?). 


By Theorem 26, if the posterior distribution can be approximated by some 
normal distribution, then 7 Ae (0) = o,(n~?), resulting that the difference 


between ISCV and WAIC is smaller than O,(n~*). 
(2) Assume that there exist constants g; and go which satisfy 


E[Gn] = g + + o(1/n). 
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Figure 8.1: The horizontal and vertical lines show the pairs of 
(log(radius), log(mass)) in the solar system. The circle corresponds to the 
sun, which is the leverage sample point. In fact, the regression line esti- 
mated by including the sun is given by the dotted line, whereas regression 
by not including it is shown by the solid line. 


By the definition 


Hence 


E|CV] = E[G,] = ElGai1] — E[G,] = O(1/n’). 


By the above theorem 


t[WAIC] — E[G,,] = O(1/n?). 


Therefore, CV and WAIC have asymptotically the same approximators of 
the generalization loss, if a sample consists of independent random variables. 


Remark 65. (Comparison of ISCV and WAIC) In the numerical experiments, 
the difference between ISCV and WAIC is very small in many cases, how- 
ever, sometimes they are different. First, if a sample {(X;, Y;)} is dependent, 
then the averages of CV and ISCV are different from that of the general- 
ization loss. On the other hand, the averages of WAIC are asymptotically 
equal to those of the generalization loss if a sample consists of conditionally 
independent variables. Second, in statistical estimation of the conditional 
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probability q(y|x), if n is not enough large to ensure 


a(x) * =) (a ~ Xi), 
i=1 


then ISCV is different from WAIC. Thirdly, if a sample contains a leverage 
sample point, then variance of ISCV diverges which is different from WAIC. 
The third sample can be understood as a special case of the second one. If a 
leverage sample point is contained in a sample, then the data analyst should 
reconsider whether such a point should be included in a sample. A leverage 
sample point can be found by the following procedure. If ISCV is not equal 
to WAIC, then for every sample point X;, the partial functional variance 


Vw [log p(¥i| Xi, w)] 


is calculated. If it is larger than the others, then X; is a leverage sample 
point. 


Remark 66. The importance sampling cross validation loss diverges if a 
leverage sample point is contained [57] [20]. Recently, a new method for nu- 
merical approximation of the cross validation was devised in which the pos- 
terior distribution is replaced by the Pareto distribution [76]. This method 
gives the approximation of the cross validaiton loss. WAIC is not an approx- 
imation of the cross validation loss but is an estimator of the generalization 
loss. If a sample is dependent, then the cross validation loss is not an esti- 
mator of the generalization loss. 


Example 55. (Leverage sample point) Let {X;} be the {log(radius)} of stars 
in the solar system, Mercury, Venus, Earth,.., and {Y;} be {log(mass)}. If 
we study a simple regression problem, Y = aX + b+ noise, then the datum 
of the sun is a leverage sample point. In Figure 8.1, the circle shows the 
datum of the sun. A regression line without the sun is shown by the solid 
line whereas regression with the sun is shown by the dotted line. The (X,Y) 
of the sun may not be estimated from other data, hence the cross validation 
fails. Even in such a case, information criteria AIC and WAIC can be used 
to estimate the statistical estimation error. 


Example 56. (Classification problem) Let us study a classification problem 
q(z|x,y) using a neural network, where (x,y) € R? and the true output z is 
set by a function, 
1 y>sin(rax/2) 
z= i 
0 otherwise 
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Figure 8.2: Classification problem. Two categories in two dimensinal space 
are classified by a neural network. The letters ’o’ and ’*’ show sample points 
classified as one and zero. The solid and dotted lines show the estimated 
and true boundaries. The sample points near the boundary are leverage 
sample points. 


A sample of 50 points is shown in Figure 8.2. The solid and dotted lines show 
the estimated and true classification boundary respectively. The letters ‘o’ 
and ‘*’ are sample points classified as one and zero by the true rule respec- 
tively. A three-layered neural network which has input units M = 2, hidden 
units H = 5, and an output unit NV = 1 was employed for learn the classi- 
fication rule. The posterior distribution is approximated by the Metropolis 
method explained in the previous chapter. In the classification problem, 
sample points near the boundary strongly affect the statistical inference: 
in fact, the classification result for such a point is not estimated from the 
other sample points. In Figure 8.2, several points which are displayed with 
numbers are leverage sample points. The partial functional variance of the 
ith sample point 
Vi = Vwllog p(Xi|w)] 


shows the strength of the sample point’s effect. Figure 8.3 shows such {V;} 
for each i. The larger V; shows that the ith sample point exerts more of 
an effect on the result. In Figure 8.3, samples 2, 4, 28, 30, 33, and 45 are 
leverage samples. 


Practical Advice. If one has a posterior parameter set generated by an 
MCMC method, then it is easy to numerically calculate ISCV, AIC, DIC, 
and WAIC. Hence the author recommends that all of them are calculated. 
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Figure 8.3: Functional variance for each sample point. The horizontal line 
shows the number of sample point in Figure 8.2. The vertical line shows 
the partial functional variance V; of each sample point. The leverage sample 
points have large partial functional variances. 


(1) If they are all equal, then they can be employed. 

(2) If a statistical model has a hierarchical structure such as a normal mix- 
ture and a neural network, and if ISCV = WAIC >> DIC, then the posterior 
distribution is not localized. In this case, DIC is not appropriate. 

(3) If ISCV 4 WAIC, then there may exist a leverage sample point. A data 
analyst had better reconsider whether such leverage sample point should 
be included or not. In conditional independent problems such as time series 
analysis, ISCV does not correspond to the generalization loss whereas WAIC 
does. 


8.1.3 Criteria for Free Energy 


In this subsection, we study the model selection problem by the free energy 
or the minus log marginal likelihood. If we know the true distribution q(x) 
and the real log canonical threshold and its multiplicity (A,m), then the 
asymptotic free energy is given by 


Fy, = nLy(wo) + Alogn — (m — 1) log logn + O,(1), 
where wo is the optimal prameter. However, wo, A, and m depend on the 


true distribution q(x), this asymptotic expansion cannot be used directly for 
estimating F),. 
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Definition 28. The free energy that is numerically calculated by eq.(7.19) 
is denoted by F( MCMC). 


F(MCMC) = Yes G (Bi41—Bx)nLn(w)] 


The free energy calculated by using the regular assumption and eq.(4.57) is 
denoted by F (REG), 


F(REG) = —- S © log p(X; |) + “logn 
i=1 


1 1 
+5 log det J(w) — log p(w) — 5 log(27), (8.42) 


where w is the maximum likelihood estimator. The Schwarz BIC is obtained 
by removing the constant order term from F'(REG), 


BIC = — S "log p(X; Iw) + “log. 
i=1 


The widely applicable Bayesian information criterion WBIC is 


WBIC = —EUY/ ben) y log p(Xi|w)| 
i=1 


(1/ a | ] 


where Ey, 


is the posterior average using 3 = 1/logn, 


[TLr0 p(Xilw) o(w)dw 
[Tlocainy%otm | 


Comparison of F( MCMC), F(REG), BIC, and WBIC. If a true model is 
regular for a statistical model, then all criteria can be employed. In such a 
case, 


£8) [f(w)] = 


F(MCMC) = Fite, 
F(REG) = F,+0,(1), 
BIC = F,+0,(1), 
WBIC = F,+0,(1), 
WBIC = BIC+o,(1), 
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where € depends on numerical calculation. If a true model is not regular for 


a statistical model, then 


F(MCMC) F, +e, 
WBIC = F,+0,((logn)!/?), (8.43) 


where the second equation is proved in [85]. The mathematical structure is 
explained in Theorem 25. Note that in Thereom 25, the function H(w) has 
no random fluctuation. In the case when H(w) is a stochastic process, then 
the refined proof shows eq.(8.43). 


Theorem 25. Assume that H(w) is an analystic function of w and p(w) 
is a C™ class function. Let F, and Fy be 


B= —be exp(—nH(w))o(w)dw, 

/ nH(w) exp(—(n/log n)H(w))p(w)dw 

Fy = 
i, exp(—(n/log n) H(w)).p(w)dw 


Then, even if the Hessian matrix V7H(wo) at a minimum point wo is sin- 
gular, 
Fy = Fy = o(log n). 


Proof. If the minimum value of H(w) is Ho, and Hi(w) = H(w) — Ho, 


Fy = nly — log f exp(—nh(w))elw)aw, 

nH\(w) exp(—(n/ log n)Ai(w))y(w)dw 
Fy = nHo+ 
[ee (n/ 108 n)Fi(w))e(w)de 


Hence we can assume Hp = 0 without loss of generality. The zeta function 
of H(w) is defined by 


(i) = [Hwy olw)aw (z€C). 


Then we can derive 
(1) In the region Re(z) > 0, ¢(z) is an analytic function of a complex variable 
2: 
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(2) ¢(z) can be analytically continued to a unique meromorphic function 
whose poles (—A1) > (—Ag) >, ..., are all real and negative values. We define 
mg as the order of the pole (—A;,). Then ¢(z) has the Laurent expansion as 


where Crm, € C. The state density and partition functions are respectively 
defined by 


(eldx (O<t <1), 


iso] 
— 
= 
a 
| 
Sa 
a 


Z(n) = f, exp(—nf(a))p(e)de (n> 0). 


Then it follows that 


1 
ce) = fe uae 


=) 
= 
| 


1 
i exp(—nt) u(t) dt. 
0 


In other words, ¢(z) and Z(n) are the Mellin and Laplace transforms of 
v(t), respectively. The following equation can be derived by mathematical 
induction about m = 1,2,... 


1 1 Ei 
——q~ = —— i i] t)yr-1 t® dt. 
@aayr eam (ost) 


By using this equation and the Laurent expansion of ¢(z), we obtain the 
asymptotic expansion, 


Coo Mr, 


=e Sh Mowe 


ee 


Therefore, the asymptotic expansion of Z(n) holds, 


co Ms, 


mea : Ap—1 m—1 
) 7 t“k~“ (log t) exp(—nt) dt 
k=1m= ja 
co ME 


ea = (r/ny*"(log(t/n))"™ expt) & 


== 


Z(n) 
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In the case n + oo, the largest order term is 


Cyl(A1) (logn)™—! 


ie (m — 1)! nt 


Hence 
F, = , logn — (m; — 1) log logn + Const. + --- 
On the other hand, by using 8 = n/logn, we define 


1 
fy = [ exw-s0 nt u(t) dt, 
0 


1 
Py = | exp(—{t) u(t) dt. 
0 
Then Fy = Fo, /F 2. By using the same method as Fi, 


RF es Cul(rAi + 1) n(log Bye? 
a m= 1)! putt 

Cul (1) ; (log Daa 

(m— 1)! pm , 


Then by P(A; + 1) = AyI(\4), it follows that 


Fy & 


Fy = X41 log n, 
which completes the theorem. O 


Example 57. A simple model selection experiment using WBIC was con- 
ducted. Let x € R?, y € R. The input X; was generated from the uniform 
distribution of [—2,2]?. The true distribution of Y; was set as p(y|z, wo) 
where p(y|x,wo) was made by a neural network defined by eq.(2.27) with 
three hidden units H = 3. From this true distribution, n = 500 sample 
points were generated. A prior was set. by the normal distribution N (0, 107) 
for each ujz and we. The 1000 posterior parameters were approximated 
by a Metropolis method with the burn-in 1000 and sampling interval 200. 
Figure 8.4 shows WBICs for a neural network with H = 1,2,3,4,5. Asina 
figure, a true model could be chosen by WBIC. 


Remark 67. In general, the asymptotic form of the free energy or the mi- 
nus log marginal likelihood depends on a true distribution, since the real 
log canonical threshold depends on the true distribution. Recently, a new 
method was devised by which both the true distribution and the free energy 
can estimated simultaneously using the real log canonical threshold [19]. 
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Figure 8.4: WBIC of neural networks. The horizontal line shows the number 
of hidden units in a neural network, and the vertical line WBIC. By using 
WBIC, a model selection according to the asymptotic free energy can be 
realized. 


8.1.4 Discussion for Model Selection 


Let us study model selection problems from two different points of view. 
This discussion is based on Professor Akaike’s argument. 


Artificial case. Assume that a true distribution is realizable by a statistical 
model which is contained in the finite set of candidate models. Since such 
a case is rare in the real world, it is called an artificial case. In the artificial 
case, the minimal model by which a true distribution is realizable is called 
the true model. A model selection algorithm is called consistent, if the 
probability that the selected model is equal to the true model converges to 
one for n — co. In general, model selection algorithms which employ the 
cross validation loss, AIC, DIC, and WAIC are not consistent. The reason 
why they are inconsistent is that random fluctuation according to a sample 
is in proportion to the difference of the generalization loss. On the other 
hand, the model selection algorithm which is based on the free energy is 
consistent, because the main order part logn is not a random variable but 
a constant which is larger than the random fluctuation. Therefore, in an 
artificial case, the free energy is better than the generalization loss. 


Natural case. Assume that we have candidate models, but, a true distri- 
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bution is not realizable by any statistical model whose parameter has finite 
dimension. Almost all statistical problems in the real world are classified 
into this case, hence are called the natural cases. In a natural case, if the 
number of random variables increases, then the best model also becomes 
more complex. Consistency has no meaning. A model selection algorithm 
is called efficient, if the average generalization loss of the selected model is 
minimized among the candidate models. From the view point of efficiency, 
the generalization loss is better than the free energy. 


When we compare several model selection problems by computer simulation, 
we often set a true distribution as an artificial case. However, such an 
experiment may be different from the natural cases. 


8.2 Hyperparameter Optimization 


A parameter of a prior distribution is called a hyperparameter. In this 
section, we study several problems in hyperparameter optimization. 


Remark 68. If a set of a statistical model p(z|w) and a prior y(w|@) is 
prepared, one might think the hyperparameter @ could be automatically 
optimized by intoducing the hyperprior distribution y)(6). However, it is 
not true. If the hyperprior distribution is employed, then it strongly affects 
the optimal hyperparameter, resulting that the chosen hyperparameter is 
not optimized but detemined by the choice of the hyperprior. In such a case 
a prior f{ p(w|9)~1(0)dé should be evaluated as a prior. Hence we need a 
method how to evaluate (p(z|w), p(w)). For example, in a nonparametric 
Bayesian estimation, the Dirichlet hyperparameter a might be determined 
by using the hyperprior, but it is not the automatically optimal one. Even 
in nonparametric cases, the cross validation and WAIC can be employed to 
evaluate the hyperparameter. 


In this section, we study the general case 
y(w) >0, ~~ but [ owidw may be infinite. 


Even in such cases, we can use the same definition of the posterior distribu- 
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tion as the case [ y(w)dw = 1, 


y(w) [Leite 
p(w|X") = ——=* 


[oo TLpwyaw 


because this definition does not require any normalizing condition of y(w). 
Hence the definitions of the generalization, cross validation, and training 
losses are also invariant. However, the free energy should be redefined by 


=— tog { TL %twet w)dw + og f (w)dw, 


because, without the second term in the right hand side, F,, is an unbounded 
function of y. 


Remark 69. (1) In the hyperparameter optimization problem about the 
generalization loss, we admit cases when f y(w)dw = oo. For example, 
p(w) = 1 on the unbounded parameter set W can be used in this section. 
On the other hand, for the free energy, [ p(w)dw < co is necessary for finite 
Fy. That is to say, the generalization loss and the free energy have the 
essential difference in preparing the set of priors. 

(2) Let w# be the maximum likelihood estimator. Then, for an arbitrary w 


n nm 
Torsite < Tax 
i=1 4=1 


Hence if there exists a sequence of priors {y;,(w)} such that y,(w) > 6(w — 
w), then the infimum value of F;, is attained by such a sequence, which 
converges to the maximum likelihood method. Therefore, when the free 
energy is applied to the prior optimization, the set of candidate priors should 
be set so as that such a sequence is not contained. 


Example 58. The Dirichlet distribution 
g(a) x a®*(1—a)P, 
converges to 6(a — ao) by 


ar = kag 
Be = k(1— ao) 


and k — co. Hence a and £ should be bounded by some constant. 
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8.2.1 Criteria for Generalization Loss 


In the model selection problem according to the generalization loss, we stud- 
ied the cross validation, AIC, TIC, DIC, and WAIC. Since the effect of the 
prior choice to the generalization loss is weaker than that of the statistical 
model, we need a precise tool to observe the difference of the small order. In 
fact, neither AIC, TIC, nor DIC can be applied to prior evaluation. In this 
subsection, we study the hyperparameter optimization by the cross valida- 
tion and WAIC. If a true distribution is regular for a statistical model, then 
we have the following theorem even if a true distribution is not realizable 
by a statistical model. 


Regular case. In regular cases, the effect of the hyperparameter optimiza- 
tion by the cross validation and WAIC are mathematically clarified. Let 
yo(w) and y(w) be arbitrary fixed and candidate priors respectively. As a 
typical case, yo(w) = 1 for all w € W can be chosen. The empirical log loss 
function and the maximum a posteriori (MAP) estimator w using yo(w) are 
respectively defined by 


i 1 
ie) = —— S bee % w= 1 8.44 
(w) - > 0g p(X; lw) — = log go(w) (8.44) 
5 = i Ln, ’ 4 
w arg min (w) (8.45) 


where either L,(w) or w does not depend on the candidate prior y(w). If 
Yyo(w) = 1, then w is equal to the maximum likelihood estimator (MLE). The 
average log loss function and the parameter that minimizes it are respectively 
defined by 


a / sey oeptaan: (8.46) 


= in D(w). A 
wo arg min (w) (8.47) 


In this section, we use the following notations for simple description. 

(1) A parameter is denoted by w = (w',w?,...,w*,...,w%) € R4 

(2) For an arbitrary function f(w) and nonnegative integers ky, ko,..., km, 
we define 

oy) 


Ow*1 Owk2 --- Owhm 


(f)ikthashem (2) = (w). (8.48) 


(3) We adopt Einstein’s summation convention and ky, ko, k3,... are used for 
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such suffixes. For example, 


d 
kok: kok: 
PO a Oa 
ko=1 


In other words, if a suffix k; appears upper and lower, it means automatic 

summation over k; = 1,2,...,d. In this section, for each k,, ko, X"*2 = 
k 

Ay? = Agihe: 


2 


Definition. (Empirical mathematical relations between priors) For a fixed 
and candidate priors yo(w) and y(w), the prior ratio function is defined by 


o(w) = plw)/po(w). 


The empirical mathematical relation between two priors at a parameter w 
is defined by 


M(¢,w) = A**? (log $)g, (log ¢) ky + BY? (log b) ky kp 


+C"" (log $) x; (8.49) 
where 
J*k2(w) = Inverse matrix of (Ln) kik, (w), (8.50) 
AMK(w) = 5h (w), (8.51) 
BER (y) = S(T (w) + Tw) THC) Fig ey(w)), (8-52) 
CM (w) = FCW) J (0) Foghat) 
— 5 TR (w) JS (w) (Lng (0) 
— 5 7% (a) F(a) J (w) 
(Ln )koksks (W) Fria ko (W); (8.53) 
and 
Fanaa) = — > 2(og r(Xilto))e log p(Xie) ays (8.54) 
i=1 
Figkats(w) = => (log r(Xilw))arka(logr(Xilw))k- (8.55) 


a=1 
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Remark. Note that neither A**2(w), B**2(w), nor C™(w) depends on a 
candidate prior y(w). Therefore M(¢,w) is determined by only log@ as a 
function of the candidate prior. 


Definition. (Average mathematical relations of priors) The average math- 
ematical relation M(¢,w) is defined by the same manner as eq.(8.49) by 
replacement 


De (w) + Inverse matrix of E[(Ln)z,k.(w)], (8.56) 
(Ln)kyko(w) +> El(Ln Jerks (w)], (8.57) 
(Ln)krkokg(W) + El(Ln)ikokg(w)], (8.58) 
Fry kg(w) +> El Fen. (w)], (8.59) 
Frvkokg(W) + ELF Ry ko,ks (w)]- (8.60) 


The following theorem shows the effect of the choice of the candidate prior 
by comparison of the fixed prior. 


Theorem 26. Let yo(w) and y(w) be fixed and candidate priors respec- 
tively. The prior ratio function is defined by 


o(w) = p(w)/go(w). 


Let M(¢,w) and M(¢,w) be the empirical and average mathematical rela- 
tions between p(w) and yo(w). As random variables, 


Vy) = CV(e0) +4") +04), (8.61) 
WAIC(y) = WAIC(yo) + Me) +On(), (8.62) 
CV(y) = WAIC(y) +0 (4). (8.63) 

Their expected values satisfy 
BICV(y)] = EICV(yo)}+ AS") 40/4), (aa) 
E[WAIC(y)] = E[WAIC(y)] + Moo) - O(), (8.65) 

where 

M(G,10) = M(6,o) + Op(—), (8.66) 
z[M($,0)] = M(4,wo) + O(-) (8.67) 
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On the other hand, the generalization loss satisifies 


Gly) = Gl go) + =(H" — (wo) )(log 6a: () + Op(=) 
= Gly) + Op) (6.68) 
BIG(o)] = ElG(yo)) + AE + 04). (8.69) 


For the proof of this theorem, see [86]. If a candidate prior has a hy- 
perparameter 6, which is written as p(w) = y(w|@), the following facts are 
derived by this theorem. 
(1) The hyperparameters that minimize E[CV], E[WAIC], and E/G,,] are 
symptotically equal to each other. 

(2) The hyperparameters that minimize CV, WAIC, and E[G,,] are asymp- 
totically equal to each other. Hence by minimizing CV or WAIC, we can 
find the optimal hyperparameter that minimizes E[G,,] asymptotically. 

(3) [Important point]. The hyperparameters that minimize the random 
variable G', and the average E[G,,] are not equal to each other even asymp- 
totically. In general they are far from each other and one does not converge 
to the other even if n tends to infinity. By minimizing CV or WAIC, we can 
find the optimal hyperparameter that minimizes E[G,,], but we cannot find 
the optimal hyperparameter that minimizes G',. 

Hence by determining the hyperparameter by minimizing the cross val- 
idation or WAIC, E[G,,] is asymptotically minimized but G,, is not. (see 
Example. 59). It is strongly conjectured that there is no observable which 
can estimate the random variable G,, for an arbitrary true distribution, be- 
cause we do not know the true distribution. This is the conjecture about 
the limit of statistical estimation. 


Remark 70. Since E[CV(qyo)] of X” is equal to E[G(yo)] of X"~! and 


1 1 1 
a 72 + 9); 


it immediately follows from Theorem 26 that 


BG(~)] = E[G(yo)] + AR) 4 5), (8.70) 
BICV(y)] = EIG(yo)] + Pt) 5 5), (gr) 
B[WAIC(y)] = E[G(yo)] +2 two) 5) (g.72) 
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Nonregular case. In nonregular cases, determination of the hyperparame- 
ter is the essential procedure of Bayesian inference. However, mathematical 
analysis for this case is still difficult, because nonregular statistical models 
have phase transitions according to the hyperparameter controlling. For es- 
timating the averages, the cross validation and WAIC can be employed. See 
Example 67. 


8.2.2 Criterion for Free Energy 


For the purpose of the minimization of the free energy, neither BIC nor 
WBIC can be applied, because they do not estimate the constant order 
term. The values F(REG) and F( MCMC) can be used in regular and all 
cases, respectively. If F(REG) is employed, then the hyperparameter that 
minimizes the F' (REG) is equal to the one that maximizes y(w). In other 
words, the hyperparameter is optimized so that the prior at the maximum 
likelihood estimator is maximized. 

For the hyperparameter optimization, there is an another method. Let 
F(a) be the free energy for a prior y(w|a), where a is a hyperparameter. 
Then 

n 


= n 
fe (w|a) [[-( (X;|w)dw 
i=1 


“da 
= -E, E log plwla)| 


Hence using the increase and decrease table, the hyperparameter that min- 
imizes F;,(a@) can be found. In order to calculate F,,(MCMC), we need all 
posterior distributions for many inverse temperatures, whereas dF, /da can 
be calculated by one MCMC process. 


Example 59. By using a statistical model which enables us to exactly cal- 
culate the generalization and cross validation losses and the free energy, let 
us study the hyperparameter optimization problem numerically. We use the 
normal distribution and its conjugate prior defined by eqs.(2.1) and (2.2). 
Let n = 200. The hyperparameter (41, $2, 63) in the region 


0< ¢, < 10 


is examined, where ¢2g = 0 and ¢3 = 1. We conducted 100 independent 
experiments. In Figure 8.5, the horizontal line shows the value ¢y. 


258 CHAPTER 8. INFORMATION CRITERIA 


x10 

5.45 
5 54 
& 8 
5 5.35 ui 
Ss c 
N 3 
= o 
£ 53 N 
2 s 
38 ro 
» 5.25 8 
2 ) 5 10 
$ 
@ 


Hyperparameter 


Free Energy 


Cross Validation 


Hyperparameter Hyperparameter 


Figure 8.5: Hyperparameter optimization. The horizontal lines in all figures 
show the value of the hyperparameter. The vertical lines show the average 
generalization error, the generalization error, the cross validation error, and 
the free energy. The minimum points of the free energy is not equal to that 
of the average generalization error. The minimum point of the generalization 
error has very large variance, and thus it is not equal to that of the average 
generalization error. The minimum point of the cross validation error is 
asymptotically equal to that of the average generalization error. 
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(1) Upper left: The average generalization error for a given hyperparameter. 
(2) Upper right: Generalization errors for a given hyperparameter. 

(3) Lower left: Cross validation losses for a given hyperparameter. 

(4) Lower right: Free energies for a given hyperparameter. 

In (2), (3), and (4), each function of a hyperparameter is displayed by cali- 
bration that the minimum value is equal to zero. The hyperparameter that 
minimizes the average generalization loss is almost equal to ¢; = 5. Each 
hyperparameter that minimizes each generalization loss strongly depends on 
a sample X”. It almost always lies on the outside of 0 < ¢; < 10. Note that 
it sometimes lies in ¢; < 0. The hyperparameter that minimizes the cross 
validation loss is in the neighborhood of ¢,; = 5. The hyperparameter that 
minimizes the free energy is in the neighborhood of ¢, = 2. These results 
show the case n = 200. If n is smaller, the variance of the chosen hyperpa- 
rameter is larger, hence too much optimization of the hyperparameter may 
be dangerous. Note that if all of (41,2, 3) are optimized simultaneously, 
then the hyperparameter diverges. 


8.2.3 Discussion for Hyperparameter Optimization 


Regular case. Assume a true distribution is regular for a statistical model, 
then: 

(1) If a hyperparameter is optimized by minimization of the cross validation 
or WAIC, then the average generalization loss is minimized asymptotically. 
However, the generalization loss itself is not minimized. Moreover the ran- 
dom fluctuation of the optimized hyperparameter is not small. 

(2) If a hyperparameter is optimized by minimizing the free energy, then 
it is asymptotically equivalent to maximizing the value of the prior at the 
maximum likelihood estimator. The random fluctuation of the optimized 
hyperparameter may be smaller than the cross validation, however, it does 
not minimize the generalization loss. 

Therefore, even if the prior is optimized, its effect to the accurate prediction 
is small. Moreover, the random fluctuation may make the variance of the 
optimized parameter larger. Therefore, too much optimization is not neces- 
sary. However, choosing the appropriate prior among several candidates by 
the cross validation or WAIC may be useful. 

Singular case. Assume that a true distribution is singular for a statisti- 
cal model. Then both the generalization loss and the free energy have the 
phase transitions for hyperparameter controlling (for the definition of the 
phase transition, see the following chapter). If the real log canonical thresh- 
old is minimized by appropriate hyperparameter choosing, then it makes 
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the generalization loss smaller. However, in singular cases, the effect of 
control hyperparameter to the precise prediction is not sufficiently clarified 
mathematically. This is an important problem for the future study. 

Example 60. (LASSO) Let us study LASSO (least absolute shrinkage and 


selection operator). Let 2 € R™, y € R“. A model and a prior are defined 
by 


1 
plylz,w) «x exp(—55lly — wal’), 
p(wlt) x exp(—€ 5° |wjxl), 

jk 


where w = {wj,} isa N x M matrix, o is a constant, and @ is a hyperparam- 
eter. The purpose of using this prior is to make the estimated parameter 
sparse. In fact, the maximum a posteriori (MAP) estimator by using this 
prior becomes sparse by choosing ¢ appropriately. Let us study the Bayesian 
case. Assume that the true distribution is q(y|z,wo). Then the Bayesian 
generalization error is 


Gn] =n5 + A +0(1/n), 


where S is the entropy of p(y|x, wo) and J is the real log canonical threshold. 
The value (—A) is equal to the maximum pole of the zeta function, 


6(2) = fw — wolP* exp(—€ > |ujel)dew 
jk 


Even if almost all elements of wo are equal to zero, the largest. pole of the zeta 
function is equal to —d/2, where d = MN is the dimension of the parameter, 
since y(wo|é) > 0. In other words, A = d/2 does not depend on the choice 
of £. Therefore, in Bayesian estimation, the prior exp(—¢ 7, |wjx|) is not 
appropriate for sparse representation of the parameter. In LASSO, Bayesian 
estimation is very different from MAP estimation. 


Example 61. (Bayesian LASSO) Let « € R“, y € R, w € R™. We study a 
statistical model 


= 1 1 2 
p(y|x, w) — Dasa OP (—ga2y— x) ), 


where o > 0 is not a parameter but a constant, and a prior 
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Figure 8.6: Generalization error by Bayesian LASSO. The horizontal line 
shows the hyperparameter a. The generalization error, the cross validation 
error, and WAIC error are compared with the theoretical value. By using 
Bayesian LASSO, the generalization error can be made smaller, if the true 
parameter is sparse. 


where C(a,¢) is a constant 


¢(l-a)/2 M 
0.9 = (Tamaya) 


for a given hyperparameter a < 1 and « > 0. Note that as a becomes 
large, the posterior distribution concetrates on the neighborhood of the ori- 
gin. Assume that the true distribution is p(y|xz,wo), where the number of 
the nonzero elements of wo is equal to Mo. The true distribution q(x) is 
the direct product of the standard normal distribution. Then the real log 
canonical threshold is 


Ma) = 5 {Mo + (1 — a)(M ~ Mo)}, (8.73) 


resulting that the asymptotic generalization error is 


Ma) 


E|G,] —S = +o(1/n). 
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For a given sample (X",Y"), the posterior distribution is 


M n M 
1 1 
pewlX".¥") x T] Ta exp(—g52 ) Y=-w- ye ) wi). 
j=l i=1 j=l 


By using a formula, 


ia T(a/2 
| ul?) exp(—w?u)du = a) ; 
0 |w|2 
the posterior distribution can be represented by 
M 
plwlx",¥") oe (TL f aus (uy?) 
j=l 
12 M M 
x exp(—s5 (Yi — w+ Xi)? — S(wyPuy - Sv?) 
i=1 j=l j=l 


Hence a Gibbs sampler for (w,u) can be constructed. In fact p(w|w) is the 
normal distribution whose average is 


1 mr 
= o=1 
m=S-(°¥%) 
i=l 


and covariance matrix is S~!. Here 
1 n 
er S° X;(Xi)" + 2Diag(e + uy, € + ug, ...,€ + um); 
i=1 


where Diag(uz, ug, ..., Uys) is the diagonal matrix whose diagonal coefficients 
are (U1, U2,...,Uuw). On the other hand, p(u|w) is the direct product of the 
gamma distribution G(u;|a/2, 1/(w;)”), where 


6Ginh= aot exp(-2), (8.74) 
Figure 8.6 shows an experimental result for the case M = 40, Mo = 10, 
and n = 200. In order to make the posterior distribution stable, if |u;| > 
Umax = 1000000, then it is replaced by Umax. The horizontal line shows the 
hyperparamater a and the generalization error, the cross validaiton error, 
and WAIC error are compared with the theoretical value. In this case the 
true parameter is sparse, hence the generalization error is made smaller by 
using Bayesian LASSO. 
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Figure 8.7: Generalization errors by the minimum cross validation. The 
solid line shows the generalization error by the model selection with respect 
to the minimum cross validation loss. The dotted line shows that of the 
unselected larger model. In this experiment, WAIC and CV resulted in 
the same model selection. Note that the model selection does not always 
minimize the generalization error. In fact, in the delicate case when two 
models are almost balanced, the generalization error becomes larger. 
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8.3 Problems 


1. Let U, be a random variable which is defined by 


Ur = > log Eulp(Xil)] 


—— ) _ Eyllog p(Xi|w)}. 


Then prove that U, has the same second order asymptotic expansion as 
WAIC. 


2. Let x,a € R®¢. A statistical model and a prior are defined by 


pala) = a apexe(—Flle ~ al), 
ela) = Boag exn(- Fila) 


Let a true distribution be p(x|ag). We study a model selection between 
statistical models p(x|a) and p(z|0). That is to say, the predictive density 
p(x) by the minimum cross validation loss is defined by 


«\ J Ewlp(elw)] (it cf? < cf) 
= p(x|0) (otherwise iy 


where C\)) is the cross validation loss of p(ala) and 
Gea. 3 log p(X;|0). 
" es 


The generalization error of the minimum cross validation loss is defined by 


p(x|ao) 
p(x\ao) log —= dx. 
| P(x) 
Then this is a function of the distance between 0 and ag.The solid line 
in Figure 8.7 shows its behavior as such a function in the case d = 5. The 
horizontal line shows the distance between the origin and the true parameter. 
The dotted line shows the generalization error of the predicitive density of 
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p(x|a) without model selection. Discuss the effect of the model selection on 
the generalization error. 


3. If the posterior distribution can be approximated by some normal distri- 
bution, then the difference between BIC and the free energy is a constant 
order term. If otherwise, the difference between WBIC and the free energy 
is at most a loglogn order term. Discuss how much such diferences affect 
the selected models. 


4. Assume that the posterior distribution can be approximated by some 
normal distribution. Then the hyperparameter that minimizes the gener- 
alization loss does not converge to the hyperparameter that minimizes the 
average generalization loss. On the other hand, the hyperparameter that 
minimizes the cross validation loss or WAIC converges to one which min- 
imizes the average generalization loss. Discuss the best procedure that a 
statistician can follow to find the minimum generalization error. 


5. A neural network which has a deep hierarchical structure has many pa- 
rameters and the posterior distribution can seldom be approximated by any 
normal distribution. For such statistical models, Bayesian estimation makes 
the generalization loss very small. However, it is difficult to approximate 
the posterior distribution by MCMC. Discuss the best procedure that a 
statistician can follow for deep learning. 


6. Prove eq.(8.73). 


Taylor & Francis 
Taylor & Francis Group 
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Chapter 9 


Topics in Bayesian Statistics 


In this chapter we research mathematical bases of several topics in Bayesian 
statistics. 

(1) The formal optimality of the Bayesian estimation is explained. 

(2) A method how to construct the Bayesian hypothesis test is explained. 
(3) The Bayesian model comparison method which is different from the 
Bayesian hypothesis test is examined, 

(4) The concept of the phase transition of the posterior distribution is in- 
troduced. 

(5) In a statatistical model which has singularities in the parameter space, 
if the sample size n is small, the posterior distribution is singular, whereas, 
if n becomes large, it becomes regular. This phenomenon is a kind of phase 
transition called the discovery process. 

(6) In hierarchical Bayesian estimation, we find several different kinds of pre- 
dictions. There are different cross validation losses and information criteria 
according to the different predictive losses. 


9.1 Formal Optimality 


If we know the true prior and the true model, then Bayesian inference is 
optimal. In this section we confirm this fact. 

Assume that ®(w) is the true prior density of the parameter w and that 
P(a|w) is the true conditional density of x. In this book, we assume that the 
true distribution is unknown, however, in this section, we study the special 
case that a random parameter W is generated from ®(w), then a sample 
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X” = (X1, Xo,..., Xn) is independently generated from P(2z|w). 


Wx GW), 
AljAinge re -Palw); 


The simultaneous probability density function of (W,X"”) is equal to 


n 


P(w,x”) = ®(w) II P(a;|w). 


i=l 


Therefore a conditional probability density of w for a given sample x” is 


equal to 
n 


®(w) [] Pele) 
P(w|2”) = ———=* ______. (9.1) 


nm 
/ dw! &(w!) T] P(ailw’) 
i=1 
This is equal to the Bayesian posterior distribution for the case when ®(w) 
and P(a|w) are chosen as a prior and a statistical model. The probability 
distribution of a new « is given by 


dw ®(w)P(2z|w) [] Pile) 
P(a\x”) =_—_—_—__—_—_—__* + ____., (9.2) 


n 


[ew &(w') [| Plailw’) 


i=1 


Also this is equal to the Bayesian predictive distribution for the case when 
®(w) and P(z|w) are chosen as a prior and a statistical model. Let us prove 
that this prediction minimizes the average Kullback-Leibler distance under 
the circumstance P(w, 2x”). 

Let f(z|X") be an arbitrary conditional density function of x for X”. 
For a given (w,x”), the Kullback-Leibler distance from P(a|w) to f(a|x”) 
is equal to 
P(a|w) 

f (alan) 


G(f|w,2")= [ ePcalw) log 


Let us define an average functional loss of f be 


Gif) = [ew B(w) TL / Peestayae G(flw, 2”). 
4=1 
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Then the function f that minimizes G(f) gives the optimal inference un- 
der the circumstance P(w,x”). The following theorem shows that such a 
function is the Bayesian predictive distribution. 


Theorem 27. The average functional loss G(f) is minimized if and only if 
f(a|X") = P(a|X"). 


Proof. The function G(f) is minimized if and only if 


: / ino) T] | Polw)das fax? (ew) toe f(el2") 
“TL f aes [ aerte.2") 0 f(alz") 


is minimized, where P(x,x") is the simultaneous density of (x,x7”). Let 
P(x") be the mariginal distribution of x” defined by the denominator of 
eq.(9.1). Then 


Go(f) 


Golf) = pt ea P(2|z”) log f(x|x") 
- I [as [ary P(ale" Jog SEY 


“TLfex fare P(a|x") log P(a|2"). 


The first term of the right hand side is the average Kullbaclk-Leiber distance 
from P(ax|x”) to f(x|xz”), and the second term is a constant function of 
f(a|z"). Therefore Go(f) is minimized if and only if P(z|z”) = f(z|z”). O 


Remark 71. (1) By this theorem, if we knew the true prior and the true 
statistical model, there is no statistical inference that attains the smaller 
average loss than Bayesian predictive inference using the true prior and the 
true statistical model. 

(2) In the real world, we do not know the true distribution. It may seem 
that the formal optimality theorem does not give any methodology to the 
real world. However, by the theorem, we can mathematically conclude that 
the question for finding the optimal statistical inference without information 
about a prior and a statistical model does not have any answer. In other 
words, the nature of the statistical estimation is ill-defined, therefore we 
need the evaluation process of a statistical model and a prior. 
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Example 62. The same theorems can be proved. 
(1) Let f(x") be a function from x” to the parameter space. The error 
function 


s(w) = f dw f de” P(w,2")Ihw— Flo") 
where || || is the norm on the parameter space, is minimized if and only if 


"= f dw w P(w,2") 
F(a") f dwP(w,2”) ’ 


which is the posterior average of the parameter using the true prior and the 
true statistical model. If we knew the true prior and the true statistical 
model, there is no other function which makes the square error smaller. In 
practical problems, we do not know the true prior and the true statisti- 
cal model, thus determining the optimal prior and model is the ill-defined 
problem. 


9.2 Bayesian Hypothesis Test 


Assume that X; is an R'-valued random variable. If a parameter w is 
subject to a prior y(w) and if X 1, Xo,..., X, are independently subject to a 
probability density p(a;|w), then such a condition is denoted by 


w ~ v(w), 
X 1, Xa, 5 Xn me p(x|w). 


In this section, we study a Bayesian hypothesis test about a null hypothesis 
(N.H.) versus an alternative one (A.H.), 


N.H. : wo~ Yo(wo), Xi ~ po(z|wo), 
AH. : wi~ yi(ui), X; ~ pi(z|wr). 


Note that wo and w, may be contained in different sets, for example, wo € 
R® and w; € R®. 

Let us study the hypothesis test for the above hypotheses. In order to 
make a hypothesis test, we need two probabilities for both hypotheses. Let 
an event © be a subset of RN”. In other words, 


© c {2 = (21, £2,...,2n) ; 21 € RX}. 
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Let Prob(@|N.H.) and Prob(©|A.H.) be conditional probabilities of a set 
© where the conditions are defined by the null and alternative hypothe- 
ses respectively. The conditional probability density function of X” = 
(X1, Xo,..., Xn) under the condition that w ~ y(w), X; ~ p(z|w) is given 
by 


n 


P(o"\p.e) = f ew) [[ o(ailw)au, (9.3) 


i=1 
Then 
Prob(O|N.H.) = [Perio o)az", (9.4) 
iS) 
Prob(9|A.H.) = [ Parp.enar”. (9.5) 
iS) 


A hypothesis test is defined by an arbitrary pair (T(a2”),t), where T(x”) is 
a real-valued function of x” and t is a real value. Once a hypothesis test is 
fixed, the decision for a given x” is determined by 


If T(a2")<t = Null hypothesis is chosen, 
Else = Alternative hypothesis is chosen. 


Any pair (T'(a”),t) gives a test, but we want the better or best one, hence we 
need an evaluation method of a given test (T(a”),t). The level and power 
for a test are respectively defined by 


Level(T,t) = Prob(A.H. is chosen. |N.#H.), (9.6) 
Power(T,t) = Prob(A.H. is chosen. |A.H.), (9.7) 


which are respectively equal to 


Level(T,t) = Prob(T(X”) > t|N.H.), (9.8) 
Power(T,t) = Prob(T(X”) > t|A.H.). (9.9) 


That is to say, the level is the probability that A.H. is chosen when N.H. is 
true, whereas the power is the probability that A.H. is chosen when A.H. is 
true. A hypothesis test which has a smaller level and a higher power gives 
the better procedure for decision. 

In practical applications, the hypothesis test procedure is conducted as 
follows. 


1. The function for the test T(x”) is fixed. 
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2. The level probability is determined. Sometimes 0.05, 0.01, or 0.005 is 
chosen. 


3. For the fixed level probability, the real value ¢ is determined such that 
Level(7’,t) is equal to the level probability. 


4. The reject region {x” ; T(x") > t} is determined. 


5. If a sample x” is contained in the reject region, then the alternative 
hypothesis is chosen, otherwise the null hypothesis is chosen. 


If two hypothesis tests (T'(#"),t) and (U(x”),u) satisfy the condition 
that 
Level(T, t) = Level(U, u) = Power(T,t) > Power(U, u), 


then (T,t) is said to be more powerful than (U,u). This definition gives a 
partial order on the set of all hypothesis tests. In general, it is not a total 
order. If there exists a test which is more powerful than any other test, 
then it is called the most powerful test. In a Bayesian hypothesis test, it is 
explicitly given by the partition function. 


Theorem 28. Assume that null and alternative hypotheses are given by 


N.H. : wo~ yo(wo), Xi ~ po(z|wo), 
A.H. : wi~ yi(wi), Xi ~ pi(zlwi). 


Then the hypothesis test (L(a"),@) defined by 


fo WI TInt "lw, )dwy 


L(a”) = ————*#51._ __ (9.10) 


[ool (wo) ITI me "|wo )dwo 


is the most powerful test. 


Proof. Let (T'(x"),t) be an arbitrary hypothesis test. Assume that a real 
value @ is set such that both levels are equal to each other, 


Level(L, 2) = Level(T, t). (9.11) 
To prove the theorem, it is sufficient to show 


AP = Power(L, £) — Power(T, t) 
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is not smaller than zero. Two events are defined by 


A {x” ; T(x”) —t > O}, 
B= {2"% he) HL > 0}. 


Here, by the defintion of L(x"), we can assume ¢ > 0. Then by the definition 
of eq.(9.8), eq.(9.11) is equivalent to 


[ Pte"tpo. ¢0)a0” — [ Pte"tpo.¢o)ae” = 0. (9.12) 


On the other hand by eq.(9.9), 


AP = Prob(L(X") > ¢|A.H.) — Prob(T(X”) > a H,) 


[Pe "Ip1, pi)dx” ee "Ip1,y1)d 


| P(2"|p1,¢1)de” — i) P(e" |pr,¢r)de”, 
BN Ac ANBe 


where A° and B° are complementary sets of A and B respectively. Note 
that BN A° c B and AN BS c B®. By eq.(9.10), the condition 2” € B is 
equivalent to 


[Pe@rp.enas” > tf Pla” po, ode 


Therefore 
AP > df P(x" |po, po)dx "-e[ Pp x" |po, po)dx” 
Bn Ac ANB¢ 
= ef P(2"|po, vo)da” — ¢ f P(x” |po, po)dx” = 0, 
B A 
where the last equation is derived by eq.(9.12). O 


Remark 72. For the most powerful test L(x”), for a given level « > 0, the 
reject region L(x”) > t is determined by choosing t such that 


Level(L, t) = Prob(L(z”) > t|N.H.) = 


Therefore, in order to make a hypothesis test, we need the probability den- 
sity function of the random variable L(x”) when the null hypothesis holds. 
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Example 63. Let us study a case in which a common statistical model is 


used and two priors are compared, 


po(ala) = pi(ela) = = exp(—F(e— a), 


BS) 
i=) 
— 
i=) 
— 
I 
a 
— 
a 
wa 


Le = = 


n+1 2(n + 1) 


an xp (Qe 


(9.13) 
(9.14) 
(9.15) 


(9.16) 


(9.17) 


Let us make a hypothesis test for the level 0.01. The real value t is deter- 


mined such that 
Prob(L(X”) > t|N.H.) = 0.01, 


where the null hypothesis is 


= (Bi oe sia 


n 
i= 


2 
ex pee! om 
l Taz p( 5 i) 


By eq.(9.17), the condition L(x") > t is equivalent to |()>j-., 2:)//n| >t, 


where 


t* = /(14+ 1/n){2 log t — log(27/(n + 1))}. 


Under the null hypothesis, the random variable (S7;_, X;)/,/n is subject to 


the standard normal distribution N(x). Since 


i: Ni(ajde = 0.01, 
|x|>2.58 


the reject region for the level 0.01 is 


ee bs ai|//n > 2.58}. 
i=1 


In other words, if | }°7_, xi|/\/n > 2.58, then the alternative hypothesis is 


chosen, if otherwise, the null hypothesis is the choice. 
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9.3. Bayesian Model Comparison 


In this section we study a Bayesian model comparison using the posterior 
distribution. This decision rule is different from the hypothesis test. 

Let the prior probabilities ag and a; satisfy 0 < ag,a, < 1, a9 +a, = 1. 
We assume that X” are generated by the following process. Firstly the 
random variable Y is determined by 


Y=0 (with probability ao), (9.18) 
Y=1 (with probability a1). (9.19) 


Then X” is generated by 


nm 

IfY=0 => wr gow), X”~ ][po(zilw), (9.20) 
J=1 
nm 

Y= = we @i(w), A" ~ [[ i @ilv). (9.21) 
1=1 


In this case the simultaneous probability density function of (Y, X") is given 
by 


plys2") = (aoP(2"|go,p0))(arP(a"|grp1))', (9.22) 


where the definition of P(x” |y, p) is given in eq.(9.3). The posterior proba- 
bility of Y = 1 for a given X” is 


ny p(1, 2”) 
pale) = Oe") wey _ 


which gives the posterior model decision. Let us study the decision rule by 
which the random variable Y is estimated by the following random variable 
Z, 
ole”) >a = 7 =1, (9.24) 
pla <2 => 2=0, (9.25) 


where 0 < a < 1isaconstant. Then the condition p(1|x") > a is equivalent 


to 
p(ijz") _l-a 


es > —_——— 
p(O|z”) a 
By eq.(9.22), it is equivalent to 
aiP(x"|~1,pi1) _ l—a 
Se 
agP(x"|0, Po) a 
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By eq.(9.3), also it is equivalent to 


fie wi) IT] “\w1)dwy 
n 


[ vo(wo) [[ ro(2" wo) dwo 


i=1 


— (9.26) 
ay a 


This inequality would be equal to the most powerful test if ag(1—a)/(a,a) = 
t. Note that, if ag = a, = a = 1/2, then the right hand side of this inequality 
is equal to 1. 


Example 64. Let us compare the most powerful test with the posterior model 
comparison. Let us adopt the same case as Example 63, 


po(ala) = pr(ala) = = exp(— (2 — 0)*) (9.27) 
yo(a) = (a), 9.28) 
wie) = se os(—50") (9.29) 


Then by eq.(9.17), 


Qn p(s) , 2 loa 


nti ?\ Ant) a 


Hence 


—— % [ne to (At). eth, (2. 428))” 


Vv log n. (9.30) 


If n is sufficiently large, then the right hand side is approximated by log n, 
whereas the most powerful test for the level 0.01 is 2.58. That is to say, in 
the hypothesis test 


| eins Zi 
Jn 


II 


< 2.58 = > Model 0 is chosen, 


whereas in the posterior model comparison, 


Le Til 


< Vlog n = > Model 0 is chosen. 
=a 
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There is no mathematical contradiction because the hypothesis test and 
the posterior model comparison are different procedures based on different 
assumptions. However, this difference is sometimes referred to as a paradox. 
The former has the same decision order as the model selection by the cross 
validation or information criteria, whereas the latter does that by the free 
energy. 


9.4 Phase Transition 


Let us define a phase transition in Bayesian statistics. 


Definition 29. (Phase transition) If a statistical model p(a|w) or a prior 
p(w) is determined by a value @ which is not the parameter, then it is written 
as p(xz|w,@) or y(w|@). Therefore, the posterior distribution is also written 
as p(w|X",0). Such a value @ is called a generalized hyperparameter. If a 
posterior distribution for a sufficiently large n changes drastically at 6 = 6¢, 
then it is said that the posterior distribution has a phase transition, and 6, 
is called a critical point. At a critical point, the free energy F;,(3, 6) is often 
discontinuous or nondifferentiable. 


If a log density ratio function f(z, w) = log(q(x)/p(z|w)) has relatively 
finite variance, then 


Fi, -Sn=- log f exp(- 3 I(Xi,w) )e(w)dw 
i=1 


and 


ee tog f exp(—n Ex(f(X,w)]) e(w)dw 


have the same asymptotic expansions according to the order that is larger 


than O(1). Hence we can analyze the phase transition by studying F’, 
instead of F;,. 


Example 65. Firstly, we study a case when the log density ratio function 
has a generalized hyperparameter. Let w = (x,y) and 


i 1 
F,(0) = — log | ae | dy exp(—na?y’). 


Let K(x,y) = x?y® (0 > 0). Then K(zx,y) > 0 and the set of all zero points 
of K(x, y) is {(x,y); zy = 0}. It can be rewritten as 


n 1 1 
F,(9) = — log f =f ax f dy exp(—t) 0(t/n — K (a, y)). 
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Since K (x,y) is a normal crossing function whose multi-indexes are k = 
(2,0) and h = (0,0), the behavior of the free energy can be analyzed by 
using the zeta function. 


aL 1 a 1 
ca= f ae [ dy(x*y") = Qr+i(@z+) 


Hence the state density function is equal to 


1/(20) -t-/?6(x)y-9/? (8 < 2) 
6(t — K(a,y)) & ¢ 1/4-t7'/2(— log t)d(x)d(y) (@ = 2) 
1/(20) -t¥/9-1z-2/95(y) (9 > 2). 


Therefore, the critical point is 9 = 2, where the posterior distribution dras- 
tically changes from 6(x) to d(y) between (2—«) + (2+€). The asymptotic 
behavior of the free energy is given by 


(1/2) logn + O(1) (0 < 2) 
F,(0) = ¢ (1/2)logn—loglogn+O(1) (0=2) , 
(1/0) logn + O(1) (@ > 2) 


which shows that the coefficient of logn is a continuous function of 6 but 
not differentiable at @ = 2. 


In Bayesian statistics, the following concepts are all mathematically con- 
nected. In order to analyze the phase transition, we can choose the most 
convenient one. 

(1) Partition function 

(2) Free energy or the minus log marginal likelihood 
(4) State density function 

(5) Zeta function 

(6) Posterior distribution 


Example 66. Secondly, let us study a case when the prior has a hyperpa- 
rameter. Assume that w = (x,y), x € R!, and y € R™. Let us study a free 
energy, 


1 
F,(8) = —log / de | dy exp(—ne?|[yl|2)29-!. 
0 ie 


The zeta function is equal to 


1 
(a= | de | dy o?*9-"Iyl[2, 
0 lyll<1 
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By using the generalized polar system (r,w) such that y = ry, then dy = 
r™—ldrdw, resulting that 


ie _ ar [ aw eet 9— 1 poet M— 1 


(2z+ ae +M) ([ ). 


Hence by using UV = [ dy), it follows that the state density function is 


¢(z) 


49/215 (ap) yM—O-1 
O(t — x? I[yl|?)a?t & { ae MPI M150) es 


Therefore the posterior distribution drastically changes from 6(x) to d(y) at 
the critical point 6 = M. Then the free energy is 


~ J (0/2)logn+O() (0<M) 
Fn(6) = { (M/2)logn+O(1) (@>M) ’ 


which shows that the coefficient of logn of the free energy is continuous at 
6 =M but not differentiable. 


Example 67. Let us study a normal mixture of x € R? for a given parameter 
a (0<a<1) and de R’, 


p(zla,b) = aN(2|0) + (1 — a) N (2b). 


For a prior, we use the Dirichlet distribution with index a > 0, 


where @ and o are hyperparameters. We set o = 10, and study the phase 
transition according to the hyperparameter a. Assume that the true distri- 
bution is 


Then the zeta function is 


where 
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Then in the neighborhood of (a,b) = 0, there exists c,c2 > 0 such that 
cya" ||b||? < K (a,b) < cga?||b||?. 
Hence the real log canonical threshold is 
A = min{a/2, 1}, 


which shows that there exists a phase transition at the critical point a = 2. 
Thus the average generalization and cross validation errors are given by 


min{a/2,1}/n + o(1/n), 
min{a/2,1}/n + o(1/n). 


Gd kz 
| 


Moreover, if a < 2, then a is in the neighborhood of the origin but 0 is free, 
whereas, if ~ > 2, then 6 is in the neighborhood of the origin but a is free. 
Therefore, the posterior distribution drastically changes at the critical point. 
Moreover, at the critical point a = 2, the posterior distribution is unstable, 
hence MCMC processes may have large variance. Let us observe the phase 
transition by an experiment. The number of independent random variables 
was set as n = 100. The hyperparameters a of the prior distribution were 
controlled in 0 < a < 6. Figures 9.1 and 9.2 show the distributions of the 
generalization errors and cross validation errors for a given hyperparameter 
respectively. The circles in both figures show their averages. Note that, by 
the equation 


(Gn — 8) + (Cu ~ Sn) = = + op(1/n), 


if G,,—S is larger than the average, then C,, —S;, is smaller than the average. 


Remark 73. (1) In this section, we study the cases where we can derive the 
poles of the zeta functions. In general, in order to find them, resolution of 
singularities is necessary, which may often be difficult. That is to say, it is 
not easy to find the critical point rigorously. 

(2) If we apply the mean field approximation, or equivalently the variational 
Bayes, to statistical estimation, then the posterior distribution made by 
the mean field approximation shrinks or becomes localized. As a result, 
the free energy becomes larger, and the phase transition structure changes. 
Sometimes a spurious phase transition can be observed in the mean field 
approximation which does not exist in the true posterior distribution. The 
critical points of the mean field approximation and the true posterior do not 
coincide in general. 
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Figure 9.1: Phase transition of generalization error in a normal mixture. 
The horizontal and vertical lines show Dirichlet hyperparameter a and the 
generalization error respectivelly. a = 2 is the critical point. 
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Figure 9.2: Phase transition of cross validation error in a normal mixture. 
The horizontal and vertical lines show Dirichlet hyperparameter a and the 
cross validation error respectivelly. a = 2 is the critical point. 
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9.5 Discovery Process 


Hierarchical statistical models such as neural networks and normal mixtures 
have several phases which are determined by the complexity of models K 
and the number of independent random variables n. In general, a true dis- 
tribution cannot be represented by any finite statistical model. However, if 
n is small, the true distribution seems to be almost realizable by a statistical 
model. If n is large, it seems to be unrealizable. Figure 9.3 shows the phase 
diagram of such statistical models. The horizontal and vertical lines show n 
and K respectively. The bold line shows the critical line. The upper side of 
the critical line is the phase in which a true distribution is realizable by and 
singular for a statistical model, and the lower side indicates the reverse. If 
n is fixed and K is controlled, then it is a model selection process. If K is 
fixed and n is controlled, then it is a discovery process. In discovery process, 
if n is small, a true distribution seems to be singular for and realizable by a 
statistical model, otherwise, it seems to be regular and unrealizable. In this 
section we study a discovery process. 

Let us consider a normal mixture of x € R? for a given parameter a 
(0<a<1) andd,ce R?, 


p(zla, b,c) = aN(a2|b) + (1 — a) N(aIc). 
For a prior, we use the Dirichlet distribution with index a > 0, 


g(a) « (a(l—a))*, 
||bII? + llell? 
20? 


y(b,c) «x exp(— ) 

where a = 0.3 and o = 10 are hyperparameters. The number of random 
variables is set as n = 5,10, 20,..., 1280. Three experiments were conducted. 
(1) A case when q(x) = p(ala,b,c) where a = 0.5,b = (2,2),c = (—2, —2) 
was investigated. In this case, a true distribution is realizable by and regular 
for a statistical model for every n, and 200 independent experiments were 
used for computing the averages and standard deviations. In Figure 9.4, 
n(Gp — S), n(ISCV — S,), n(WAIC — S;,,), n(AIC — S,,), and n(DIC — S,,) 
are displayed. In this case the n times averages converged to d/2 = 2.5 
where d is the dimenson of the parameter. If n > 20, every criterion could 
estimate the generalization error. In this case, there was no phase transition. 
(2) A case when q(x) = p(zla,b,c) where a = 0.5,b = (0,0),c = (0,0) was 
investigated. In this case, a true distribution was realizable by but nonreg- 
ular for a statistical model for every n. In this case the n times average of 
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Figure 9.3: Phase diagram. Statistical models such as neural networks and 
normal mixtures have phase transitions according to the number of compo- 
nents and the sample size. If n is small, then a true distribution seems to 
be realizable and singular. If n is large, then it seems to be unrealizable and 
regular. 


the generalization error converged to \ = 1.5 where 4 is the real log canon- 
ical threshold. In Figure 9.5, n(G, — S), n(ISCV — S,,), n(WAIC — S,,), 
n(AIC — S;,), and n(DIC — S,,) are displayed, which shows that both ISCV 
and WAIC could estimate the generalization losses, whereas neither AIC 
nor DIC could. In this case, there was no phase transition. 

(3) A case when g(x) = p(zla,b,c) where a = 0.5,b = (0.5,0.5),c = 
(—0.5, —0.5) was investigated. In Figure 9.6, n(G, — S), n(ISCV — S,), 
n(WAIC — S,,), n(AIC — S,,), and n(DIC — S,,) are displayed. In the region 
n < 20, the generalization loss was almost equal to the case when the true 
distribution is singular for a statistical model, which is the case (1). In the 
region n > 320, it was almost equal to the case when the true distribution 
is regular for a statistical model, which is the case (2). And in the region 
40 <n < 160, the generalization errors were larger than in the other cases, 
which were on the critical point. Both ISCV and WAIC could estimate the 
generalization losses, whereas neither AIC nor DIC could. In this case, there 
was a phase transition according to knowledge discovery process. 


By using the same model as above (3), let us study the statistical esti- 
mation problem of the hidden variable. A hidden variable y is introduced 
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Figure 9.4: Strict regular case. The horizontal and vertical lines show the 
sample size and errors respectively. A true distribution is set as realizable 
by and regular for a statistical model. No phase transition occurs. 


n= DATA 


Figure 9.5: Strict singular case. The horizontal and vertical lines show the 
sample size and errors respectively. A true distribution is set as realizable 
by and singular for a statistical model. No phase transition occurs. 
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Figure 9.6: Discovery process. The horizontal and vertical lines show the 
sample size and errors respectively. As n increases the true structure is 
discovered by a statistical model. A phase transition occurs. 


by the following equation, 
p(x, yla, b,c) = [aN(a|b)|® - (1 — a) N(aIe))*~¥. 
That is to say, 


p(x, 0\a,b,c) = aN(a—)), 
p(x, 1ja,b,c) = (1—a)N(a—- 0c), 


By marginalizing y this model results in a normal mixture, 
p(zla, b,c) =aN(x2 — b)+ (1—a)N(a —c). 


The likelihood function of (x",y”) is given by 


n 
p(x”, y"|a,b,e) = | [lan *.[(1—a)N(a;-—0)]*™. 

i=1 
Figures 9.7, 9.8, and 9.9 show the true distribution q(z) at left, and estimated 
hidden variables for n = 10,100,1000 at right. If the hidden variable of a 
sample point x; was estimated y; < 0.5, x; is shown by a point, if otherwise, 
a white square. For n = 10, it seems that a true distribution consists of one 
normal distribution, whereas for n = 1000, two distributions. For the case 
n = 100, it is on a critical point, hence the estimated results were unstable. 
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Figure 9.7: Estimated hidden variables n = 10. A true distribution and 
estimated latent variables in n = 10. Although the true distribution consists 
of two components, only one component is found since n is too small. 


Remark 74. (1) In the above experiment, the hidden variable y = (0,1) can 
be replaced by y = (1,0), because of symmetry. However, MCMC naturally 
made this symmetry break down, which was used for distinguishing 0 and 
1. 

(2) The phase transition affects the estimation of hidden variables. The 
phase transition also affects the MCMC process. Hierarchical statistical 
models such as neural networks and normal mixtures have the same phase 
transition as this case, hence a statistician must know its structure before 
applying statistical models to the real world problems. 


9.6 Hierarchical Bayes 


In this section, hierarchical Bayesian estimation is studied. A typical case 
is explained by using an example. 


Example 68. (Hierarchical Bayesian inference) In a high school, there are 
m = 10 classes, and each class has n = 30 students. One day, an examination 
of mathematics was done, and mn = 300 scores {xy;;1 <k <m,1<i<n} 
were obtained. We would like to analyze the following statistical model. 
(1) The average w,; of kth class’ scores are subject N (ju, 17). 

(2) The score 2; of the ith student in the kth class is subject to N(w,, 107). 
That is to say, the model is made by 


Wk ™ N(u, 1°), 
Lei © N (wz, 107). 
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Figure 9.8: Estimated hidden variables n = 100. A true distribution and 
estimated latent variables in n = 100. The posterior distribution is unsta- 
ble, because it lies on the critical point between one component and two 
components. 


Figure 9.9: Estimated hidden variables n = 1000. A true distribution and 
estimated latent variables in n = 1000. Two components were found. 
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We would like to evaluate this model according to the obtained data. 


Statistical Model 

(1) Let yz be the hyperparameter. 

(2) {we }%_, are independently taken from the prior distribution y(wlj.). 
(3) (xp)” = {rei }2_1 are independently taken from p(a|w,). 


From the predictive point, there are at least two different types of prediction. 
(1) (New student prediction) If a new student is added to the k class, we 
predict a new score. 

(2) (New class prediction) If a new class which has 30 students is added to 
the high school, we predict new scores. 

The evaluation of the statistical model depends on the type of prediction. 


New Student Prediction. For given all data {(x;)"}, the posterior dis- 
tribution of all classes (w1, w2,...,Wm) is 


n 


p(w, W255 Wrnl(@1)", (w2)", 5 (tm)") x T] (o(weles) TT p(easten)).- 
k=1 


i=1 


This distribution shows that (wy1, wa, ...,Wm) are independent of each other, 


hence 
n 


p(welxg) x p(wels) | | r(eeilwe)- 
i=1 


Let E,,,[ ] and V,,,[ ] be the average and variance operators of this distribu- 
tion. The predictive distribution y of the kth class is equal to E,,, [p(y|we)], 
resulting that ISCV and WAIC for the &th class are 


1 
ISCV_ = =) log Ew, [1/p(ainlwe)], (9.31) 
i=1 
1 nm 
WAIC, = =) he bw, P(iz|We)] 
i=1 
1 nm 
+- Y- Vw, [log p(winlwe)]. (9.32) 
i=1 


If a student is added to every class, then 


> ISCV,, a WAIC; 
k k 
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are equal to ISCV and WAIC, respectively. Therefore, the model evaluation 
can be done by using this value. 


New Class Prediction. In the second problem, one sample point is (x,)" 
and we have a sample which consists of (#1)", (x2)”, ...,(@m)". The statis- 


tical model is 
nm 


P((0x)"|) = / (wl) TT rlemle)aw, 


4=1 


In this case, the hyperparameter is parameter of this model. By setting a 
posterior distribution 7(j2) of yu, the posterior distribution is given by 


P(H|(w1)”, (@2)", «+5 (@m)”) x v(H) [] Pwr)" Ie). 


> 
ll 
un 


Let E,,[ ] and V,,[ ] be average and variance operators by this distribution. 
The predictive distribution of y” is equal to E,[P(y"|)], resulting that 


ISCV = — J log Ey[1/P((a4)" lh (9.33) 
k=1 
WAIC = ~~ Slog EulP((xe)” |H)] 
k=1 
+= VullogPU(e)" (9.30) 
k=1 


In the second problem, jz is estimated by the posterior distribution. In 
general, it is rather difficult to numerically calculate the cross validation 
and WAIC in the second case. 


Remark 75. Cross validation and information criteria are defined for eval- 
uation of the predictive behavior of the statistical estimation. A complex 
statistical model such as hierarchical Bayes methods may yield several dif- 
ferent predictions. To make an evaluation method, a statistician should 
determine which prediction would be evaluated. 


Example 69. (Hierarchical Bayes in linear regression) Assume that x,y € R 
and apz,bp,s € R. Let K be the number of the groups. For each k, a 
statistical model p(y|x, ax, by, 8) and a prior y(ag, be|Ma, Mp, t)O(s|r, €) are 
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defined by 


= 38. 8 ay 
plylr.ansb.,8) = y/gmexp(—3(y — ane — b4)?), 


 exp(—E(ox — ma)? + (6x —ma)*?), 


r 


(slr) = pays" texp(-es), 

where (mq, mp) and r are hyperparameters. Note that (ma, my) is the com- 
mon average of the prior which is estimated by using the uniform and im- 
proper prior, and r is optimized by the cross validation and WAIC. Let 
t = 1/0.2? and ¢ = 0.01 be fixed. Assume that the true distribution is given 
by the common conditional density p(y|x, ao, bo, so) where ag = 1, bo = 0, 
and so = 0.1. Let {(xpi, Yes) }7%, be a sample for the kth group whose sample 
size is nz. The posterior distribution for (s,az,b~,7™a, Mp) is in proportion 
to 


p(ax, b,|™Ma, Ms) 


K Uk 
$(s|r) II p(ax, by. |™Ma, Mp) [[e@edleni, ak; br, 5), 
k=1 i=1 
which is also in proportion to 
K 
sl exp(— aE: p(—5{ (an — 19)? + (bx — m24)?}) 


Hy; = exp( 5 (yk 4k; — by)?). 


This posterior density can be approximated by a Gibbs sampler, 


K 
P(™Mals, ax; dx) = ( (1/k) So an, V1/(tK) Kk) ), 
k=1 
K 
p(mp|s, ax, bh) = N((1/K) 2b V1/(tK) ), 


P(slaz, bg) = G(r + (1/2) Some 1/B), 


k=1 
p(ax, dg |Ma, Mp, 8) = N (A710, A~"?), 
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Table 9.1: Hyperparameters, cross validation error, and WAIC error. Cross 
validation and WAIC errors are compared as a function of hyperparameter 


in hierarchical Bayes. In this case r = 1 is chosen which minimizes both 
cross validaiton and WAIC errors. 


(Table) 


where N(m,S') is the normal distribution whose average and covariance 
are m and S, respectively and G(a,b) is the gamma distribution defined by 
eq.(8.74), and 


ee re eee ae 
— Give ee 


tmp + 8 par Yki 


B 


K nr 
+ (1/2) 87 $2 (uni — ances — be)? } 


k=1 i=1 


For K = 6, nx =5 +k, the cross validation and WAIC errors for the first 
case are compared. See Table 69. In this case, r = 1 which minimizes both 
errors is chosen as the best hyperparameter. 


9.7 Problems 


1. Assume that w is taken from ©(w), and then (X"+!, Y"+?) are indepen- 
dently taken from P(x, y|w). For function f(z|x",y”) from x to y, which is 
determined by (x", y”), the square error is defined by 


Ef) = [ two) / dao 1 gy? P(g) yw) 
x[lYnt1 — f(tngile”,y")II?, 

where ||y|| is the norm of y. Prove that this square error is minimized if and 

only if 


J dw®(w ) Sf dyn+1 Yn4+1 Pa ane y"t|w) 


f(tn4ilz",y") = Tw (w) | doa Par yey 
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which is the regression function made from the predictive distribution. 


2. Let x,a € R®. A statistical model and a prior are defined by 


al 
Pla) = aa 


el) = Gag exr(-Filal) 


Devise a hypothesis test for the null hypothesis (p(a|a),6(a)) versus the 
alternative one (p(z|a),y(a)). Then compare it with the minimum cross 
validation loss estimation defined by eq.(8.75). 


3. Let «,b € R@ and y,a € R. For a statistical model and a prior, 


1 1 
p(y|x, a, b) mae HP (gill — a tanh(b-2)|!") 


(a,b) ox fale", 


clarify the phase transition structure according to the hyperparameter a > 
0. 


4, Let a = (a1, Q9,...,aK) satisfy )° ay = 1 and ax > 0. Also let bk CR. A 
statistical model of « € RY 


K 

— b,) 

p(a|a, b) - VR = exp( (Eo) 
k=1 


is called a normal mixture. Let the prior be a constant on (a,b). Assume 
that a true distribution is given by 


Then explain the discovery process of this case. 


Chapter 10 


Basic Probability Theory 


In this chapter, we summarize the basic probability theory which is an im- 
portant component in this book. 

(1) Delta function is defined. 

(2) Kullback-Leibler distance is introduced and its mathematical property 
is proved. 

(3) The definitions of the probability space and the random variable are 
described. 

(4) An empirical process is a random variable on a function space. In 
Bayesian theory construction, we need its convergence in distribution. The 
basic definitions and essential theorems are introduced. 

(5) Even if a sequence of random variables converges in distribution, the 
sequence of its expected values may not converge. To prove the conver- 
gence of the expected values, we need the additional condition such as the 
asymptotically uniformly integrablility. 


10.1 Delta Function 


The delta function 6(x) is defined to be a generalized function of « € R 
which satisfies, for an arbitrary continuous function f(z), 


i 5(«) f(w)dx = f(0). 


The delta function is not an ordinary function but a kind of a distribution, 
which formally satisfies 


50) 68; [ o(oaz = 4, 
2 


93 
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We can understand that 6(x) is the probability density function of the ran- 
dom variable X that is X = 0 almost surely. For 2 = (21, %2,...,¢V) € RN, 
the multi-dimensional delta function 6(x) is defined by 


Example 70. Assume a > 0. Then 
ad(ax + b) = d(a + b/a). (10.1) 


Let us show this equality. Let f(x) be an arbitrary continuous function. By 
the transform y = ax + b, dy = adx and 


f estar + os x)de = ows y — b)/a)dy = f(—b/a). 


On the other hand, 
[oe 4 bias ide = Hb). 


10.2. Kullback-Leibler Distance 


Let q(x) and p(x) be probability density functions on R’. Then the Kullback- 
Leibler distance or the relative entropy is defined by 


(zx) 
wa)? 


The Kullback-Leibler distance indicates the difference between q(a) and p(x) 
by the following lemma. 


De 


D(allp) = J: (x) log 


Lemma 26. Assume that q(x) and p(x) are continuous functions. 
1. For arbitrary q(x) and p(x), D(q||p) > 0. 
2. D(q||p) = 0 if and only if q(x) = p(x) for all x such that q(x) > 0. 
Proof. Let us define a function U(t) for t > 0 by 
U(t) =1/t+logt—1. 
Then 
Daily) = f ayo (22) ae. 


p(2) 
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Note that U(t) > 0 for arbitrary t > 0. Since q(x) > 0, the first half of 
thelemma was proved. Let us prove the second half. If g(x) = p(x) for 
arbitrary x such that q(x) > 0, then D(q||p) = 0. Assume D(q||p) = 0. U(#) 
is a continuous function of t, hence U(q(x)/p(x)) is a continuous function 
of x. Thus U(q(x)/p(x)) = 0 for arbitrary x such that q(x) > 0. Hence 
q(x) = p(x), because U(t) = 0 is equivalent to t = 1. O 


The log loss function of p(x) is defined by 


L(y) =~ f a(z)log p(a)ae. 


Then by the definition of Kullback-Leibler distance, 


iG. = 7 a(e) log(q(#)/p())de — / g(a) log q(x) de 
= D(allp) +s. (10.2) 


In this equation, S is the entropy of q(x) which does not depend on p(z) 
and D(q||p) > 0. That is to say, 


L(p) is small <= > D(q||p) is small. (10.3) 


In other words, minimization of KL distance is equivalent to minimization 


of L(p). 


Remark 76. (Calculation of generalization error) As is shown in the above 


proof, 
[wo log ae = [acu (42}) ae, 


where U(x) > 0. There are two different ways to approximate the Kullback- 
Leibler distance. Let {X;} be a set of random variables which are indepen- 
dently subject to the probability density function g(x). Then D(q||p) can 
be approximated by 


ie ay (535) 


Sometimes variance of Dy is smaller than Dj. 


Di & + os SS fea 
) 
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10.3. Probability Space 
Definition 30. (Metric space) Let 2 be a set. A function D 
D:Qx 03 (z,y) 4 D(z,y) ER 


is called a metric if it satisfies the following three conditions. 

(1) For arbitrary x,y € 2, D(z, y) = D(y, x) > 0. 

(2) D(x, y) = 0 if and only if « = y. 

(3) For arbitrary z,y,z € 0, D(z, y) + D(y,z) > D(a, z). 

A set Q with a metric is called a metric space. The set of open neighborhoods 
of a point x € 2. is defined by {U.(x);« > 0} where 


U-(z) = {yEQ; D(z,y) < €}. 
A metric space 2 is called separable if there exists a countable and dense 
subset {x;;4 = 1,2,3,...}. A set {x;;i = 1,2,3,...} is said to be a Cauchy 
sequence if, for arbitrary 6 > 0, there exists M such that 


17 2M = Daa) <4. 


If any Cauchy sequence in a metric space 2 converges in 2, then 2 is called 
a complete metric space. 


Example 71. (1) Finite dimensional real Euclidean space R?@ is a separable 
and complete metric space with the metric 


d 


1/2 
D(z,y) = |z- yl = (Sow?) 
i=1 
where x = (a;), y = (y), and |-| is a norm of R¢. 


(2) Let K be a compact subset in R¢. The set of all continuous function 
from R¢ to R@ 
Q={f; f:R' +R} 


is a metric space with the metric 
D(f,9) = ||f — gl| = max | f(x) — g(2)|, 
cek 


where | - | is the norm of R“. By the compactness of K in R4, it is proved 
that 2 is a complete and separable metric space. 
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Definition 31. (Probability space) Let Q be a separable and complete 
metric space. A set 6 made of subsets contained in 2 is called a sigma 
algebra or a completely additive family if it satisfies the following conditions. 
(1) If Ay, Ao € B, then Ay M Ao € B. 

(2) If A € B, then A° € B (A® is the complementary set of A). 

(3) If Ai, Ao, A3..,€ B, then the countable union U2, Ax € B. 

A pair of a metric space and a sigma algebra (9,8) is called a measurable 
space. A function P 


P:B3AW0< P(A) <1 


is called a probability measure if it satisfies 
CP) <1. 


(2) For {By} which satisfies B, NBy = @ (k # k’), P(U%, Bx) =P By). 


A triple of a metric space, a sigma algebra, and a probability. rnesaaes 
(Q,B,P) is called a probability space. If (0,6, P) satisfies the condition 
that an arbitrary subset of a measure zero set is contained in 6, then it 
is called a complete probability space. In this book, we assume that the 
probability space is complete. 


Remark 77. Any probability space can be made complete by extending the 
sigma algebra and the probability measure so that any subset contained in 
a measure zero set belongs to the extended algebra. The smallest sigma 
algebra that contains all open subsets of Q is called the Borel field. In 
general the Borel field is not complete but it can be made complete by such 
completion procedure. 


Remark 78. Let (R%,B,P) be a probability space, where RY is the N di- 
mensional real Euclidean space, 6 the completion of the Borel field, and P 
a probability distribution. If P is defined by a function p(x) > 0, 


P(A) = I p(a)dx (AEB), 


then p(a) is called a probability density function. 


Definition 32. (Random variable) Let (2,8, P) be a complete probability 
space and (Q;,6,) a measurable space. A function 


X:23wH X(w) EM, 
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is said to be measurable if X~!(B,) € B for arbitrary B,; € B,. A measurable 
function X on a probability space is called a random variable. Sometimes 
X is said to be an ()-valued random variable. By the definition 


(By) = P(X~*(Bi)), (10.4) 


jt is a probability measure on (21, 8,), hence (Qy, 61,4) is a probability 
space. The probability measure p is called a probability distribution of the 
random variable X. Then X is said to be subject to uw. Note that yz is the 
probability distribution on the image space of a function of X. The equation 
(10.4) can be rewritten as 


i nea = P(da). 


Remark 79. (1) In probability theory, the simplified notation 
P(F(X) > 0) = P({w € OQ; f(X(w)) > OF) 


is often used. Then by definition, P(f(X) > 0) = w({x € 01; f(x) > OF). 
(2) In descriptions of definitions and theorems, sometimes we need only the 
information of the image space of a random variable X and the probability 
distribution Py. In other words, there are some definitions and theorems 
in which the explicit statement of the probability space (0,6, P) is not 
needed. In such cases, the explicit definition of the probability space is 
omitted, resulting in the statement such as “for Q)-valued random variable 
X which is subject to a probability distribution Px satisfies the following 
equality...” 


Definition 33. (Expected value) Let X be a random variable from the 
probability space (0,8, P) to (Q;,81) which is subject to the probability 
distribution Py. If the integration 


BX] = f X(w)P(du) = fe Px(de) 


is well defined and finite in Q), ELX] € 9, is called the expected value, the 
average, or the mean of X. Let S be a subset of Q). The expected value 
with restriction S is defined by 


E[X]¢5} = Fh yee XP) = [, Px(ae): 
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Remark 80. The following are elemental remarks. 

(1) Let (21,81) and X be the same as in Definition 33 and (Q2,B2) be 
a measurable space. If f : 2; — Qe is a measurable function, f(X) is a 
random variable on (0,6, P). The expected value of f(X) is equal to 


‘[F(X)] = / f(X(w)) P(dw) = ) f(x) Px (dz). 


This expected value is often denoted by Ex|f(X)]. 

(2) Two random variables which have the same probability distribution have 
the same expected value. Hence if X and Y have the same probability 
distribution, we can predict E[Y] based on the information of E[X’]. 


Definition 34. (Convergence of random variable) Let {Xn} and X bea 
sequence of random variables and a random variable on a probability space 
(Q,B, P), respectively. 

(1) It is said that X,, — X almost surely (almost everywhere), if 


P( {w EO; lim X,(w) = X(w)} ) =i 


(2) It is said that X, — X in the mean of order p > 0, if E[|X,,|?] < co and 
if 


lim E[(X, — X)?] =0. 


n—- oo 
(3) It is said that X,,— X in probability, if 
lim P(D(Xn,X) >) =0 
nN—- oo 
for an arbitrary « > 0, where D(-,-) is the metric of the image space of X. 


(4) It is said that X, — X in distribution or in law, if 
lim E[F(X,,)] = E[F(X)] 


n> Co 


for an arbitrary bounded and continuous function F’. 


Definition 35. (1) It is said that {X,} is uniformly tight or bounded in 
probability, if 
lim sup P(|X,| > M) =0. 
Mc n 


(2) It is said that {X,,} is asymptotically uniformly integrable, if 


lim sup E[|Xnl|qx,|>1} = 0. 


M>o n 


If {X,,} is asymptotically uniformly integrable, then it is uniformly bounded. 
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The following mathematical relations are important and useful. 


Lemma 27. (Relation between several convergences) 

(1) If Xp, + X almost surely, then X, + X in probability. 

(2) If Xn + X in the mean of order p > 0, then Xn — X in probability. 
(3) If Xp, — X in the mean of orderp>1 andp>q>1, then X, > X in 
the mean of order q. 

(4) If Xn > X in probability, then X,— X in distribution. 

(5) If Xn — a in distribution where a is a constant, then Xp, — a in 
probability. 

(6) If Xy, + X in distribution, then {X,} is uniformly tight. 

(7) If {Xy} is uniformly tight, then there exists a subsequence of {X,,} which 
converges in distribution. 


In probability theory, many theorems between random variables are de- 
rived. The following are main results which we use in this book. 


Lemma 28. (Continuous mapping theorem) Assume that f is a continuous 
function from a metric space to a metric space. Then by the continuous 
mapping theorem, the following hold. 

(1) If Xn > X almost surely, then f(Xn) > f(X) almost surely. 

(2) If Xn + X in probability, then f(Xn) > f(X) in probability. 

(3) If Xy, + X in distribution, then f(Xn) > f(X) in distribution. 

Note that these results hold even if the set of discontinuous points of f is a 
measure zero subset. 


Lemma 29. (Convergences of synthesized random variables) 

(1) If both X, > X and Y;, > Y almost surely, then both Xy,+Yn—7 X+Y 
and XnYn—> XY almost surely. 

(2) If both X, + X and Y, 3 Y in the mean of order p, then Xn + Yn > 
X +Y in the mean of order p. 

(3) If both X, + X and Y;, — Y in probability, then both X,+Ypn— 7 X+Y 
and XpYn—> XY in probability. 

(4) If both X, > X and Y,, > a in distribution where a is a constant, then 
Xnt+Yn 7 X +a and XpyYyn > aX in distribution. Moreover if a #0, then 
Xn/Yn > X/a in distribution. 


Remark 81. Even if both X, > X and Y, — Y in distribution, neither 
Xn +Y¥n 7 X+Y nor X,Y, > XY in distribution, in general. 


Lemma 30. (Convergences of expected values) In order to prove the con- 
vergence of expected values, the following theorems are employed. 
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(1) If Xn — X in the mean of order p, then E[X?] < co and E|(X,,)?] > 
EL X?]. 

(2) If Xn > X almost surely, and if there exists a random variable Y such 
that |X,| < Y and E[Y] < cw, then E[X,] — E[X]. 

(3) Assume that E[X,] and E[X] are finite. If Xn > X in distribution and 
if {Xn} is asymptotically uniformly integrable, then E[X,] > E[X]. 

(4) If sup, E[|Xn|'**] < co for some € > 0, then {X,} is asymptotically 
uniformly integrable. 


For a short description, we adopt the following definition. 


Definition 36. Let k > 0 be a positive real value and let X and {X,} be 
random variables. It is said that {X,,} satisfies the asymptotic expectation 
condition with index k, if the convergence in distribution X, — X holds 
and if there exists ¢ > 0 such that E[|X|***] < 00 and 


sup E||X,|"** < co. 
n 


If {X,,} satisfies the asymptotic expectation condition with index k, then 
E[(Xn)*] > ELX*]. 


The following notations are often used in statistics. 


Definition 37. Let {an;a, > 0} and {x,,} be sequences of real values. 
(1) The notation 


C= Olds) 
means that £,,/a, — 0. 
(2) The notation 

ty, = Ol Gy) 


means that sup,, |Xn/an| < oo. 


Definition 38. Let {an;a, > 0} and {X,,} be sequences of real values and 
random variables respectively. 
(1) The notation 

Xp, = Op( an) 


means that X,,/an — 0 in probability. 
(2) The notation 
Ap =O Gin) 


means that {X;,/a,} is uniformly tight. 
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10.4 Empirical Process 


In this section we explain empirical process theory which is necessary in 
statistics. Let q(x) be a probability density function on R™” and f(x,w) be 
an R-valued function of ¢ € R” and w € W C R¢ which satisfies 


| tewao)dr = 0. 


Let X 1, Xo, ..., Xp, be random variables which are independently subject to 
the probability density function q(x). The emprical process &,,(w) is defined 
by 


1 n 
En(w) = (XG). (10.5) 
ae 
We assume that each element of the covariance matrix 
sw) = f f2,w)F(e,w)Pale)de 


is finite for an arbitrary w © W. Let €(w) (w € W) be a random process 
whose average and covariance matrix are zero and S(w). By the central limit 
theorem, for each w, &,(w) converges to €(w) in distribution. In statistics, 
we need stronger results than the convergence in distribution &,(w) > €(w) 
for each w. For example, convergences in distribution 


sup ||n(w)|| > sup ||é(w)| (10.6) 
wEew wEew 


and 


| én(w)*dw / €(w)Fdw (10.7) 
Ww WwW 


for some k > O are necessary in statistical theory. The empirical process 
theory enables us to prove such convergences. 


Let ||g||.0 be the supremum norm of a function g(w), 
IIglloo = sup |lgw)]], 
wEew 


and C(W) be a set of all continuous functions on a compact set W, 


C(W) = {g(w) is continuous on W}. 
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Then C(W) is a complete and separable metric space with the norm ||_ ||,. 
Both functions 


gt sup ||g(w)|| 
wew 


and 
go | g(w)* dw 
Ww 


are continuous on C(W). Hence in order to prove eq.(10.6) and eq.(10.7), it 
is sufficient to prove the convergence in distriution €, > € on C(W) holds. 


There are several mathematically sufficient conditions which ensure €,, > 
€ in distriution on C(W) holds. The following are examples of such sufficient 
conditions. 

(1) If f(x, w) is represented by 
[oe] 
f(a,w) =) gw) f(a) 


j=l 


where )7,; |cj(w)| <M for some M and E[f;(X)] =0 and 7, Ol fj(X)|7] < 
oo, then &,, > € in distribution on C(W) holds. 

(2) Assume that E[f(X,w)] = 0 and E[|f(X,w)|?] < oo and w € W € R¢. 
If f(z, w) is sufficiently smooth, for example, 


max sup ||0" f(x, w)/dw*|| < co 
BBS, SUp NON (e, w)/0| 


for an arbitrary x, where |k| = kj +ko+---+Kq for k = (ki, ke,.., ka), then 
En — € in distribution on C(W) holds. 
10.5 Convergence of Expected Values 


Let €,(w) be an empirical process defined by eq.(10.5). We often need the 
convergence 


ELF (fn)] > EelF'(§)], (10.8) 


for a given function F on C(W), where E| | and E¢| | show the expected val- 
ues over X1, X9,...,X, and € respectively. If F' is a bounded and continuous 
function on C(W), then eq.(10.8) is derived from the definition of conver- 
gence in distribution €, — €. Even if F is unbounded, if F’ is continuous 
and if there exists « > 0 such that 


sup E[|F(€,)|1**] < 00, (10.9) 


n 
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then eq.(10.8) holds. If €, — € in distribution on C(W) and if there exists 
€ > 0 such that 


E[ sup ||En(w)||"**] < 00, (10.10) 
wEew 


then it is said that €,(w) satisfies the asymptotic expectation condition with 
index k. If such a condition is satisfied, then 


E[sup |é.(w)||*] + Ee[sup ||€(w)I|*]. (10.11) 
wEew wEew 


The following lemma shows a sufficient condition for the asymptotic expec- 
tation condition with index k. 


Lemma 31. Let x ¢ RX, w CW CR?. Assume that a function f(x, w) is 
represented by 


Flew) =D e(w)fi(2). 


j=l 


Let k > 2 be a postive integer. The sequences {c;} and {t;} are defined by 
cj = sup |e;(w)|, 
wEew 
tj = Ellf;(x)|**"). 


If El f;(X)] =0 9 =1,2,3,...), if E(w) > E(w) in distribution on C(W), 
and if 


C7 < OW, cit; < Ww, 
Ls Soe 


then the sequence 
1 n 
sup [én(w)| = sup|—= Y> F(X.) 
~ ee ae 


satisfies the asymptotic expectation condition with index k. 


Proof. Let €=k+1. Since a (¢ > 2) is a convex function of x, for arbitrary 
{aj} and {bj}, 
(2elelaly? ec 2a laillbslE 
dalajs] 47 ~ delay 


(So aybs)* < (2 Legh)" layllby I). 


Hence 
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By using this inequality, 


Bfsuplén(w)l"] = Bsup|S>es(w)( Yo (%))| 
< (Lo) (Loe L neo) 
j =1 
< (Soe) MS GEL (XID) < 00 
where we used the following Lemma 32. O 
Lemma 32. Let X1, Xo, ..., Xn be independent random variables which 


are subject to the same probability distribution. Let k > 2 be an integer. 
Assume that 


E||X;|*] < oo, E[X;] = 0. 


Then a 
1 
_—— X; 
n Fae a 


satisfies that, for an arbitrary positive integer n, 


E[(Yn)*] < k* TEX"). 


Proof. Let 1 = /—1 and 


(t) = Elexp(itX/yn)]. 


Then 


E[(Yn)"] = (6(0)") lr=0, 
where (o(t)")( = (d/dt)" (¢(t)"). In order to prove this lemma, it is suffi- 
cient to prove that, for an arbitrary integer r (r = 1,2,...,k), 


(O(£)") |z-0] < v7 1A", (10.12) 


where A = E[|X|*]!/*. Let us prove eq.(10.12) by mathematical induction. 
For the case r = 1, eq.(10.12) holds. For k, assume that eq.(10.12) holds for 
l<r<k-1. 


(o(t)") 


(no(t)"*a(t)')@—D 
k-1 
(* 7 *) ater Y(t)”. 


ii 
r=0 
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Hence 


By using ELX] = 0, 


(P(t) E10 = I EMP ELXE] 
< nh (k r)/2 qk T < Ak , 


for r= 0,1, 2,...,4 — 2, and the assumptions of the mathematical induction 


k-1 


k-1 
pr | = AR k—1)" = A*RR-t 
Kory] = are, )e-0 | 
where we used r7—! < (k —1)""! for r = 1,2,...,k —1. O 


10.6 Mixture by Dirichlet Process 


In this section, we introduce an inifinite mixture using Dirichlet process. 
Firstly, the finite mixture of normal distributions is defined as follows. In 
this section, if a random variable X is subject to a probability distribution 
P, it is denoted by 


X~P. 


Let N(a|b) be a normal distribution of « € R™ whose average is b € R™, 
— pl? 
N(a\|b) = ety 


ee (- 
(Qn) Mj SP 2 


Firstly we represent the normal mixture defined by eq.(2.28) as a sample- 
generating model. For a finite positive integer K, let a = (a1, q2,...,aK) bea 
parameter which satisfies }> a; = 1 and a; > 0, and by € R¢@. The Dirichlet 
distribution with index 6 = {6} 


inal) = =] (a) 


10.6. MIXTURE BY DIRICHLET PROCESS 307 


Let (bz) be some prior of by. The normal mixture is represented as a 
generating model of (Xj, X2,..., Xn), 


a ~ (als), 
k ~ Multi(a), 
b ~ (bx), 


zi ~ N(a\|bi), 


where k ~ Multi(a) shows that an integer k (1 < k < K) is subject to the 
one-time multinomial distribution with probability a = (a1, a2,...,aK). 


Definition 39. (Dirichlet process) Let (R™,B) be a measurable space of 
R™”, a> 0 be aconstant, and Go be a measure on (R™, B). A family of a 
measurable sets {B; € B; 7 = 1,2,...,m} which satisfies 


BjO Be = @ (Gx k): U;B; =R™, 


is called a disjoint partition of R@. A probability measure-valued random 
variable G is said to be subject to the Dirichlet process if, for an arbitrary 
disjoint partition, 


(G(B,), G(Bs), .-, G(Bm)) ~ Dir(alaGo (Bi), @Go(B2), «-, ©Go(Bm)). 
The probability distribution of G is denoted by DP(a, Go). 


Then the infinite mixture model, which is formally given by kK — oo and 
Gy = a/K, is represented by 


G m DP(a,?), 
howe Se 
a, ~ N(a\b;). 


It is known that the Dirichlet process gives the discrete sum with probability 
one. 


Taylor & Francis 
Taylor & Francis Group 
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